How to Block All AI Crawlers in robots.txt
There are over a dozen AI crawlers roaming the web right now, each collecting content for different AI companies. Blocking one isn't enough if the others are still helping themselves to your content. Here's the complete list and the exact robots.txt rules to block them all.
The Complete AI Crawler List (2026)
Every known AI crawler that respects robots.txt, who operates it, and what it does:
| User-Agent | Company | Purpose |
|---|---|---|
GPTBot |
OpenAI | Training data for GPT models |
ChatGPT-User |
OpenAI | Real-time web browsing in ChatGPT |
ClaudeBot |
Anthropic | Training data for Claude models |
anthropic-ai |
Anthropic | Research and safety evaluation |
Google-Extended |
Gemini AI training (not search) | |
PerplexityBot |
Perplexity | AI-powered search answers |
CCBot |
Common Crawl | Open dataset used by many AI labs |
Bytespider |
ByteDance | Training data for ByteDance AI |
meta-externalagent |
Meta | Training data for Meta AI / Llama |
Applebot-Extended |
Apple | Apple Intelligence training |
cohere-ai |
Cohere | Training data for Cohere models |
Diffbot |
Diffbot | Web data extraction / knowledge graph |
Omgilibot |
Webz.io | Web data for AI training sets |
FacebookExternalHit |
Meta | Link preview + potential AI training |
The Copy-Paste Block (All AI Crawlers)
Add this to your robots.txt. It blocks every known AI crawler while keeping search engines fully allowed:
# ==========================================
# AI Crawlers: BLOCKED
# ==========================================
# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
# Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Google AI (not search)
User-agent: Google-Extended
Disallow: /
# Perplexity
User-agent: PerplexityBot
Disallow: /
# Common Crawl
User-agent: CCBot
Disallow: /
# ByteDance
User-agent: Bytespider
Disallow: /
# Meta
User-agent: meta-externalagent
Disallow: /
# Apple
User-agent: Applebot-Extended
Disallow: /
# Cohere
User-agent: cohere-ai
Disallow: /
# Diffbot
User-agent: Diffbot
Disallow: /
# Webz.io
User-agent: Omgilibot
Disallow: /
# ==========================================
# Search Engines: ALLOWED
# ==========================================
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Check if your robots.txt is configured correctly
Our AI Readiness Checker scans for all 14 AI crawlers and shows exactly which ones are blocked or allowed.
Run AI Readiness CheckUnderstanding the Trade-offs
Blocking all AI crawlers is the maximum-protection option. But it's not free -- here's what you're giving up:
- AI visibility -- ChatGPT, Claude, and Perplexity won't reference your content accurately when users ask about topics in your space.
- AI search traffic -- Perplexity and similar AI search tools won't link to your pages in their answers.
- Future AI products -- As AI becomes more integrated into how people find information, being absent from training data means being absent from answers.
What you keep:
- Content control -- Your work isn't used to train models without your consent.
- Search rankings -- Google and Bing rankings are completely unaffected.
- Server resources -- Several AI crawlers are aggressive. Blocking them reduces unnecessary server load.
Why You Can't Use a Single Wildcard Rule
A common question: "Can't I just add one rule to block all AI bots?"
No. robots.txt doesn't have an "AI crawler" category. The wildcard User-agent: * blocks everything, including search engines. Each AI crawler uses a unique user-agent string, so you need a separate rule for each one.
That's why using a robots.txt generator with AI presets saves time. One click generates all the rules correctly.
The CCBot Wildcard
CCBot deserves special attention. Common Crawl maintains the largest public web dataset, and many AI companies (including some that don't have their own crawlers) use Common Crawl data for training. Blocking CCBot is like closing a back door -- even if you block GPTBot directly, OpenAI might still access your content through Common Crawl's dataset if CCBot was allowed.
Note: Common Crawl data that was already collected before your block remains in their archive. The block only prevents future crawls.
Beyond robots.txt: Additional Protection
robots.txt is the minimum. For stronger protection:
- HTTP headers -- Some crawlers respect
X-Robots-Tag: noaior similar headers, though there's no universal standard yet. - IP blocking -- Major AI companies publish their crawler IP ranges. You can block these at the firewall level for hard enforcement.
- Rate limiting -- If you want to allow some crawling but limit volume, implement rate limiting at the server level.
- TDM Reservation Protocol -- The EU's TDM (Text and Data Mining) protocol via
tdmrep.jsonprovides a legal framework for content usage policies.
Keeping the List Updated
New AI crawlers appear regularly. The list above is current as of April 2026, but it will grow. Strategies to stay current:
- Monitor your server access logs for unfamiliar bot user-agents
- Check our AI Readiness Checker periodically -- we update the crawler list as new ones appear
- Follow announcements from major AI companies about new crawlers
Generate your robots.txt with AI bot presets
One-click block-all preset generates every rule above. No manual typing needed.
Open Robots.txt GeneratorFrequently Asked Questions
How many AI crawlers are there in 2026?
As of April 2026, there are at least 15 known AI crawlers that respect robots.txt, including GPTBot and ChatGPT-User (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google/Gemini), PerplexityBot (Perplexity), CCBot (Common Crawl), Bytespider (ByteDance), meta-externalagent (Meta), Applebot-Extended (Apple), cohere-ai (Cohere), Diffbot, FacebookExternalHit, and several others. New crawlers appear regularly as more companies build AI products.
Can I block all AI crawlers with a single rule?
No. Each AI crawler has its own user-agent string, and robots.txt requires separate rules for each one. A wildcard User-agent: * rule would block all crawlers including search engines like Google and Bing, which you don't want. You need to list each AI crawler individually. Using a robots.txt generator with AI presets is the fastest way to do this correctly.
Will blocking AI crawlers affect my SEO or search rankings?
No. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) are completely separate from search engine crawlers (Googlebot, Bingbot). Blocking AI crawlers has zero impact on your Google or Bing search rankings. The only exception is Google-Extended, but even that is separate from Googlebot -- blocking it only affects Gemini AI training, not search indexing.