How to Block All AI Crawlers in robots.txt

Updated April 9, 2026 · 7 min read

There are over a dozen AI crawlers roaming the web right now, each collecting content for different AI companies. Blocking one isn't enough if the others are still helping themselves to your content. Here's the complete list and the exact robots.txt rules to block them all.

The Complete AI Crawler List (2026)

Every known AI crawler that respects robots.txt, who operates it, and what it does:

User-Agent	Company	Purpose
`GPTBot`	OpenAI	Training data for GPT models
`ChatGPT-User`	OpenAI	Real-time web browsing in ChatGPT
`ClaudeBot`	Anthropic	Training data for Claude models
`anthropic-ai`	Anthropic	Research and safety evaluation
`Google-Extended`	Google	Gemini AI training (not search)
`PerplexityBot`	Perplexity	AI-powered search answers
`CCBot`	Common Crawl	Open dataset used by many AI labs
`Bytespider`	ByteDance	Training data for ByteDance AI
`meta-externalagent`	Meta	Training data for Meta AI / Llama
`Applebot-Extended`	Apple	Apple Intelligence training
`cohere-ai`	Cohere	Training data for Cohere models
`Diffbot`	Diffbot	Web data extraction / knowledge graph
`Omgilibot`	Webz.io	Web data for AI training sets
`FacebookExternalHit`	Meta	Link preview + potential AI training

The Copy-Paste Block (All AI Crawlers)

Add this to your robots.txt. It blocks every known AI crawler while keeping search engines fully allowed:

# ==========================================
# AI Crawlers: BLOCKED
# ==========================================

# OpenAI
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Google AI (not search)
User-agent: Google-Extended
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Meta
User-agent: meta-externalagent
Disallow: /

# Apple
User-agent: Applebot-Extended
Disallow: /

# Cohere
User-agent: cohere-ai
Disallow: /

# Diffbot
User-agent: Diffbot
Disallow: /

# Webz.io
User-agent: Omgilibot
Disallow: /

# ==========================================
# Search Engines: ALLOWED
# ==========================================

User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Check if your robots.txt is configured correctly

Our AI Readiness Checker scans for all 14 AI crawlers and shows exactly which ones are blocked or allowed.

Run AI Readiness Check

Understanding the Trade-offs

Blocking all AI crawlers is the maximum-protection option. But it's not free -- here's what you're giving up:

AI visibility -- ChatGPT, Claude, and Perplexity won't reference your content accurately when users ask about topics in your space.
AI search traffic -- Perplexity and similar AI search tools won't link to your pages in their answers.
Future AI products -- As AI becomes more integrated into how people find information, being absent from training data means being absent from answers.

What you keep:

Content control -- Your work isn't used to train models without your consent.
Search rankings -- Google and Bing rankings are completely unaffected.
Server resources -- Several AI crawlers are aggressive. Blocking them reduces unnecessary server load.

Why You Can't Use a Single Wildcard Rule

A common question: "Can't I just add one rule to block all AI bots?"

No. robots.txt doesn't have an "AI crawler" category. The wildcard User-agent: * blocks everything, including search engines. Each AI crawler uses a unique user-agent string, so you need a separate rule for each one.

That's why using a robots.txt generator with AI presets saves time. One click generates all the rules correctly.

The CCBot Wildcard

CCBot deserves special attention. Common Crawl maintains the largest public web dataset, and many AI companies (including some that don't have their own crawlers) use Common Crawl data for training. Blocking CCBot is like closing a back door -- even if you block GPTBot directly, OpenAI might still access your content through Common Crawl's dataset if CCBot was allowed.

Note: Common Crawl data that was already collected before your block remains in their archive. The block only prevents future crawls.

Beyond robots.txt: Additional Protection

robots.txt is the minimum. For stronger protection:

HTTP headers -- Some crawlers respect X-Robots-Tag: noai or similar headers, though there's no universal standard yet.
IP blocking -- Major AI companies publish their crawler IP ranges. You can block these at the firewall level for hard enforcement.
Rate limiting -- If you want to allow some crawling but limit volume, implement rate limiting at the server level.
TDM Reservation Protocol -- The EU's TDM (Text and Data Mining) protocol via tdmrep.json provides a legal framework for content usage policies.

Keeping the List Updated

New AI crawlers appear regularly. The list above is current as of April 2026, but it will grow. Strategies to stay current:

Monitor your server access logs for unfamiliar bot user-agents
Check our AI Readiness Checker periodically -- we update the crawler list as new ones appear
Follow announcements from major AI companies about new crawlers

Generate your robots.txt with AI bot presets

One-click block-all preset generates every rule above. No manual typing needed.

Open Robots.txt Generator

Frequently Asked Questions

How many AI crawlers are there in 2026?

As of April 2026, there are at least 15 known AI crawlers that respect robots.txt, including GPTBot and ChatGPT-User (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google/Gemini), PerplexityBot (Perplexity), CCBot (Common Crawl), Bytespider (ByteDance), meta-externalagent (Meta), Applebot-Extended (Apple), cohere-ai (Cohere), Diffbot, FacebookExternalHit, and several others. New crawlers appear regularly as more companies build AI products.

Can I block all AI crawlers with a single rule?

No. Each AI crawler has its own user-agent string, and robots.txt requires separate rules for each one. A wildcard User-agent: * rule would block all crawlers including search engines like Google and Bing, which you don't want. You need to list each AI crawler individually. Using a robots.txt generator with AI presets is the fastest way to do this correctly.

Will blocking AI crawlers affect my SEO or search rankings?

No. AI crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) are completely separate from search engine crawlers (Googlebot, Bingbot). Blocking AI crawlers has zero impact on your Google or Bing search rankings. The only exception is Google-Extended, but even that is separate from Googlebot -- blocking it only affects Gemini AI training, not search indexing.