Complete List of AI Crawlers in 2026 (With User-Agent Strings)

Updated April 9, 2026 · 9 min read

There are over 20 AI crawlers actively hitting websites right now. Some collect training data for language models. Others fetch pages in real-time when users ask AI assistants questions. A few do both.

Here's every known AI crawler as of April 2026, with the exact user-agent strings you need to block or allow them in your robots.txt. We're keeping this list updated as new crawlers appear.

The Complete AI Crawler Table

User-Agent	Company	Purpose	Type
`GPTBot`	OpenAI	Training data for GPT models	Training
`ChatGPT-User`	OpenAI	Real-time browsing in ChatGPT	Search
`OAI-SearchBot`	OpenAI	SearchGPT / ChatGPT search feature	Search
`ClaudeBot`	Anthropic	Training data for Claude models	Training
`anthropic-ai`	Anthropic	Legacy Anthropic crawler	Training
`Claude-Web`	Anthropic	Real-time web access for Claude	Search
`Google-Extended`	Google	Training data for Gemini AI	Training
`GoogleOther`	Google	Non-search crawling (R&D, AI)	Both
`PerplexityBot`	Perplexity AI	Real-time search answers + training	Both
`Bytespider`	ByteDance	Training data for ByteDance AI (TikTok)	Training
`CCBot`	Common Crawl	Open web archive used by many AI companies	Training
`FacebookBot`	Meta	Training data for Llama models	Training
`Meta-ExternalAgent`	Meta	Meta AI assistant web browsing	Both
`Meta-ExternalFetcher`	Meta	Real-time content fetching for Meta AI	Search
`Applebot-Extended`	Apple	Training data for Apple Intelligence	Training
`Amazonbot`	Amazon	Alexa answers + Amazon AI training	Both
`cohere-ai`	Cohere	Training data for Cohere language models	Training
`Timesbot`	Various / Undisclosed	Content scraping for AI training	Training
`YouBot`	You.com	AI search engine indexing	Search
`Diffbot`	Diffbot	Web data extraction for AI knowledge graphs	Training
`Omgilibot`	Webz.io	Web data collection for AI datasets	Training
`PetalBot`	Huawei	Petal Search + AI training	Both

The Big Five: Crawlers You Must Know

1. GPTBot (OpenAI)

The one that started the AI crawler conversation. GPTBot collects content that feeds into GPT-4, GPT-5, and future OpenAI models. It's been active since August 2023 and is one of the most aggressive AI crawlers by volume.

User-agent string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

IP ranges: Published at openai.com/gptbot-ranges.txt

OpenAI also runs ChatGPT-User for real-time browsing and OAI-SearchBot for their SearchGPT feature. These are separate crawlers -- blocking GPTBot doesn't affect them.

2. ClaudeBot (Anthropic)

Anthropic's crawler for Claude model training. It replaced the older anthropic-ai user-agent in 2024. ClaudeBot is known for respecting Crawl-delay directives, which most other AI crawlers ignore.

User-agent string:

Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +https://www.anthropic.com/clau debot)

Claude-Web is the real-time browsing agent, used when Claude needs to fetch a page during a conversation.

3. Google-Extended (Google)

This is the one that confuses people. Google-Extended is NOT Googlebot. Blocking it doesn't affect your search rankings at all. It's a separate crawler that collects data for Gemini and other Google AI products.

User-agent string: Google-Extended

Google also uses GoogleOther for non-search crawling, including AI research. If you want to be thorough, block both.

4. Bytespider (ByteDance)

ByteDance's crawler is one of the most aggressive by sheer crawl volume. It collects training data for ByteDance's AI products, including those behind TikTok. It's notorious for high request rates and doesn't always respect Crawl-delay.

User-agent string:

Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Bytespider; spider-feedback@bytedance.com) Chrome/51.0.2704.103 Safari/537.36

5. CCBot (Common Crawl)

Common Crawl is a non-profit that builds an open web archive. The catch: their dataset is widely used by AI companies for training, including by companies that don't run their own crawlers. Blocking CCBot reduces your exposure across multiple AI training pipelines.

User-agent string:

CCBot/2.0 (https://commoncrawl.org/faq/)

Training vs. Search Crawlers

This distinction matters because you might want to block one type but not the other:

Training crawlers collect content that gets baked into AI models permanently. Once your content is in the training data, there's no removing it. Blocking these prevents future training on your content.

GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot, FacebookBot, Applebot-Extended, cohere-ai, Diffbot

Search crawlers fetch pages in real-time when users ask questions. It's like Google fetching a page for search results -- temporary, not stored for training. Blocking these means AI assistants can't cite or link to your content in conversations.

ChatGPT-User, OAI-SearchBot, Claude-Web, YouBot, Meta-ExternalFetcher

Dual-purpose crawlers do both:

PerplexityBot, Amazonbot, Meta-ExternalAgent, GoogleOther, PetalBot

The "Block Everything" robots.txt

If you want to block every known AI crawler, here's the complete robots.txt block:

# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /

# Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /

# Google AI
User-agent: Google-Extended
Disallow: /
User-agent: GoogleOther
Disallow: /

# Perplexity
User-agent: PerplexityBot
Disallow: /

# ByteDance
User-agent: Bytespider
Disallow: /

# Common Crawl
User-agent: CCBot
Disallow: /

# Meta
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /

# Apple
User-agent: Applebot-Extended
Disallow: /

# Amazon
User-agent: Amazonbot
Disallow: /

# Cohere
User-agent: cohere-ai
Disallow: /

# Others
User-agent: Timesbot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: PetalBot
Disallow: /

Generate this robots.txt with one click

Our Robots.txt Generator has presets for blocking all AI crawlers, training-only crawlers, or custom selections.

Open Robots.txt Generator

The "Smart Block" Approach

A more strategic approach: block training crawlers but allow search crawlers. This way, AI assistants can still cite your site in real-time answers (free traffic), but your content won't be used for model training.

# BLOCK Training Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /

# ALLOW Search/Browsing Crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Meta-ExternalFetcher
Allow: /

This is the approach we'd recommend for most content publishers. You protect your content from being absorbed into training data while still appearing in AI-powered search results.

How to Detect AI Crawlers in Your Logs

Want to see which AI crawlers are actually hitting your site? Check your access logs:

Nginx

# Find all AI crawler hits in the last 24 hours
grep -E "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot|Google-Extended|FacebookBot|Amazonbot|cohere-ai|Applebot-Extended" /var/log/nginx/access.log

Apache

grep -E "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot" /var/log/apache2/access.log

Count requests per crawler

grep -oE "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot|Google-Extended|FacebookBot" /var/log/nginx/access.log | sort | uniq -c | sort -rn

This tells you exactly which crawlers are most active on your site and helps you prioritize which ones to block first.

Beyond robots.txt: Server-Level Blocking

robots.txt is voluntary. For stronger protection, block AI crawlers at the server level:

Nginx

# Add to your server block
if ($http_user_agent ~* "(GPTBot|ClaudeBot|Bytespider|CCBot|anthropic-ai|Google-Extended|FacebookBot|Meta-ExternalAgent)") {
    return 403;
}

Apache (.htaccess)

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot|anthropic-ai|Google-Extended) [NC]
RewriteRule .* - [F,L]

Cloudflare

If you're using Cloudflare, go to Security → WAF → Custom Rules and create a rule that blocks requests where the user-agent contains AI crawler strings. Cloudflare also has a built-in "AI Bots" category in their Bot Management settings.

Check which AI crawlers can reach your site

Our AI Readiness Checker scans your robots.txt and reports which of these crawlers are blocked or allowed.

Run AI Readiness Check

Keeping Up with New Crawlers

New AI crawlers show up every few months. Here's how to stay ahead:

Monitor your logs: Set up a monthly check for unfamiliar user-agents with high request volumes
Watch the Dark Visitors list: darkvisitors.com maintains a community-driven list of AI crawlers
Use our checker: The AI Readiness Checker is updated as new crawlers are identified
Follow OpenAI and Anthropic docs: Major AI companies publish their crawler documentation, including new user-agents

We'll update this page as new crawlers are discovered. Bookmark it or check back monthly.

Frequently Asked Questions

How many AI crawlers are there in 2026?

As of April 2026, there are at least 20 known AI crawlers actively crawling the web. The major ones are GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google/Gemini), PerplexityBot (Perplexity AI), and Bytespider (ByteDance). New ones appear regularly as more companies build AI products that need web data.

Do all AI crawlers respect robots.txt?

Most reputable AI crawlers from major companies (OpenAI, Anthropic, Google, Perplexity, Apple, Amazon) respect robots.txt rules. However, some crawlers, especially those from smaller or less transparent companies, may not fully comply. For stronger protection, you can combine robots.txt with server-level IP blocking or HTTP header-based restrictions.

What's the difference between AI training crawlers and AI search crawlers?

AI training crawlers (like GPTBot, Google-Extended, CCBot) collect content to train language models. Once trained, your content is baked into the model permanently. AI search crawlers (like ChatGPT-User, PerplexityBot) fetch pages in real-time when users ask questions, similar to how Google fetches pages for search results. Some crawlers like Bytespider do both.

Should I block all AI crawlers or just some?

It depends on your goals. If you want maximum content protection, block all AI crawlers. If you want your site to appear in AI-powered search results (ChatGPT browsing, Perplexity answers), you might want to allow search-focused crawlers while blocking training-only crawlers. A common middle ground: block all training crawlers but allow ChatGPT-User and PerplexityBot for real-time search visibility.

How do I check which AI crawlers are visiting my site?

Check your server access logs for AI crawler user-agent strings. Look for GPTBot, ClaudeBot, PerplexityBot, Bytespider, and others in your log files. Alternatively, use the ZeroKit.dev AI Readiness Checker to scan your robots.txt and see which crawlers are currently blocked or allowed on your site.

Complete List of AI Crawlers in 2026 (With User-Agent Strings)

The Complete AI Crawler Table

The Big Five: Crawlers You Must Know

1. GPTBot (OpenAI)

2. ClaudeBot (Anthropic)

3. Google-Extended (Google)

4. Bytespider (ByteDance)

5. CCBot (Common Crawl)

Training vs. Search Crawlers

The "Block Everything" robots.txt

The "Smart Block" Approach

How to Detect AI Crawlers in Your Logs

Nginx

Apache

Count requests per crawler

Beyond robots.txt: Server-Level Blocking

Nginx

Apache (.htaccess)

Cloudflare

Keeping Up with New Crawlers

Frequently Asked Questions

How many AI crawlers are there in 2026?

Do all AI crawlers respect robots.txt?

What's the difference between AI training crawlers and AI search crawlers?

Should I block all AI crawlers or just some?

How do I check which AI crawlers are visiting my site?

Related Guides