Complete List of AI Crawlers in 2026 (With User-Agent Strings)
There are over 20 AI crawlers actively hitting websites right now. Some collect training data for language models. Others fetch pages in real-time when users ask AI assistants questions. A few do both.
Here's every known AI crawler as of April 2026, with the exact user-agent strings you need to block or allow them in your robots.txt. We're keeping this list updated as new crawlers appear.
The Complete AI Crawler Table
| User-Agent | Company | Purpose | Type |
|---|---|---|---|
GPTBot |
OpenAI | Training data for GPT models | Training |
ChatGPT-User |
OpenAI | Real-time browsing in ChatGPT | Search |
OAI-SearchBot |
OpenAI | SearchGPT / ChatGPT search feature | Search |
ClaudeBot |
Anthropic | Training data for Claude models | Training |
anthropic-ai |
Anthropic | Legacy Anthropic crawler | Training |
Claude-Web |
Anthropic | Real-time web access for Claude | Search |
Google-Extended |
Training data for Gemini AI | Training | |
GoogleOther |
Non-search crawling (R&D, AI) | Both | |
PerplexityBot |
Perplexity AI | Real-time search answers + training | Both |
Bytespider |
ByteDance | Training data for ByteDance AI (TikTok) | Training |
CCBot |
Common Crawl | Open web archive used by many AI companies | Training |
FacebookBot |
Meta | Training data for Llama models | Training |
Meta-ExternalAgent |
Meta | Meta AI assistant web browsing | Both |
Meta-ExternalFetcher |
Meta | Real-time content fetching for Meta AI | Search |
Applebot-Extended |
Apple | Training data for Apple Intelligence | Training |
Amazonbot |
Amazon | Alexa answers + Amazon AI training | Both |
cohere-ai |
Cohere | Training data for Cohere language models | Training |
Timesbot |
Various / Undisclosed | Content scraping for AI training | Training |
YouBot |
You.com | AI search engine indexing | Search |
Diffbot |
Diffbot | Web data extraction for AI knowledge graphs | Training |
Omgilibot |
Webz.io | Web data collection for AI datasets | Training |
PetalBot |
Huawei | Petal Search + AI training | Both |
The Big Five: Crawlers You Must Know
1. GPTBot (OpenAI)
The one that started the AI crawler conversation. GPTBot collects content that feeds into GPT-4, GPT-5, and future OpenAI models. It's been active since August 2023 and is one of the most aggressive AI crawlers by volume.
User-agent string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
IP ranges: Published at openai.com/gptbot-ranges.txt
OpenAI also runs ChatGPT-User for real-time browsing and OAI-SearchBot for their SearchGPT feature. These are separate crawlers -- blocking GPTBot doesn't affect them.
2. ClaudeBot (Anthropic)
Anthropic's crawler for Claude model training. It replaced the older anthropic-ai user-agent in 2024. ClaudeBot is known for respecting Crawl-delay directives, which most other AI crawlers ignore.
User-agent string:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +https://www.anthropic.com/clau debot)
Claude-Web is the real-time browsing agent, used when Claude needs to fetch a page during a conversation.
3. Google-Extended (Google)
This is the one that confuses people. Google-Extended is NOT Googlebot. Blocking it doesn't affect your search rankings at all. It's a separate crawler that collects data for Gemini and other Google AI products.
User-agent string: Google-Extended
Google also uses GoogleOther for non-search crawling, including AI research. If you want to be thorough, block both.
4. Bytespider (ByteDance)
ByteDance's crawler is one of the most aggressive by sheer crawl volume. It collects training data for ByteDance's AI products, including those behind TikTok. It's notorious for high request rates and doesn't always respect Crawl-delay.
User-agent string:
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko; compatible; Bytespider; spider-feedback@bytedance.com) Chrome/51.0.2704.103 Safari/537.36
5. CCBot (Common Crawl)
Common Crawl is a non-profit that builds an open web archive. The catch: their dataset is widely used by AI companies for training, including by companies that don't run their own crawlers. Blocking CCBot reduces your exposure across multiple AI training pipelines.
User-agent string:
CCBot/2.0 (https://commoncrawl.org/faq/)
Training vs. Search Crawlers
This distinction matters because you might want to block one type but not the other:
Training crawlers collect content that gets baked into AI models permanently. Once your content is in the training data, there's no removing it. Blocking these prevents future training on your content.
- GPTBot, ClaudeBot, Google-Extended, Bytespider, CCBot, FacebookBot, Applebot-Extended, cohere-ai, Diffbot
Search crawlers fetch pages in real-time when users ask questions. It's like Google fetching a page for search results -- temporary, not stored for training. Blocking these means AI assistants can't cite or link to your content in conversations.
- ChatGPT-User, OAI-SearchBot, Claude-Web, YouBot, Meta-ExternalFetcher
Dual-purpose crawlers do both:
- PerplexityBot, Amazonbot, Meta-ExternalAgent, GoogleOther, PetalBot
The "Block Everything" robots.txt
If you want to block every known AI crawler, here's the complete robots.txt block:
# OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
# Anthropic
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
# Google AI
User-agent: Google-Extended
Disallow: /
User-agent: GoogleOther
Disallow: /
# Perplexity
User-agent: PerplexityBot
Disallow: /
# ByteDance
User-agent: Bytespider
Disallow: /
# Common Crawl
User-agent: CCBot
Disallow: /
# Meta
User-agent: FacebookBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Meta-ExternalFetcher
Disallow: /
# Apple
User-agent: Applebot-Extended
Disallow: /
# Amazon
User-agent: Amazonbot
Disallow: /
# Cohere
User-agent: cohere-ai
Disallow: /
# Others
User-agent: Timesbot
Disallow: /
User-agent: YouBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: PetalBot
Disallow: /
Generate this robots.txt with one click
Our Robots.txt Generator has presets for blocking all AI crawlers, training-only crawlers, or custom selections.
Open Robots.txt GeneratorThe "Smart Block" Approach
A more strategic approach: block training crawlers but allow search crawlers. This way, AI assistants can still cite your site in real-time answers (free traffic), but your content won't be used for model training.
# BLOCK Training Crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
# ALLOW Search/Browsing Crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Meta-ExternalFetcher
Allow: /
This is the approach we'd recommend for most content publishers. You protect your content from being absorbed into training data while still appearing in AI-powered search results.
How to Detect AI Crawlers in Your Logs
Want to see which AI crawlers are actually hitting your site? Check your access logs:
Nginx
# Find all AI crawler hits in the last 24 hours
grep -E "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot|Google-Extended|FacebookBot|Amazonbot|cohere-ai|Applebot-Extended" /var/log/nginx/access.log
Apache
grep -E "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot" /var/log/apache2/access.log
Count requests per crawler
grep -oE "GPTBot|ClaudeBot|Bytespider|CCBot|PerplexityBot|Google-Extended|FacebookBot" /var/log/nginx/access.log | sort | uniq -c | sort -rn
This tells you exactly which crawlers are most active on your site and helps you prioritize which ones to block first.
Beyond robots.txt: Server-Level Blocking
robots.txt is voluntary. For stronger protection, block AI crawlers at the server level:
Nginx
# Add to your server block
if ($http_user_agent ~* "(GPTBot|ClaudeBot|Bytespider|CCBot|anthropic-ai|Google-Extended|FacebookBot|Meta-ExternalAgent)") {
return 403;
}
Apache (.htaccess)
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot|anthropic-ai|Google-Extended) [NC]
RewriteRule .* - [F,L]
Cloudflare
If you're using Cloudflare, go to Security → WAF → Custom Rules and create a rule that blocks requests where the user-agent contains AI crawler strings. Cloudflare also has a built-in "AI Bots" category in their Bot Management settings.
Check which AI crawlers can reach your site
Our AI Readiness Checker scans your robots.txt and reports which of these crawlers are blocked or allowed.
Run AI Readiness CheckKeeping Up with New Crawlers
New AI crawlers show up every few months. Here's how to stay ahead:
- Monitor your logs: Set up a monthly check for unfamiliar user-agents with high request volumes
- Watch the Dark Visitors list: darkvisitors.com maintains a community-driven list of AI crawlers
- Use our checker: The AI Readiness Checker is updated as new crawlers are identified
- Follow OpenAI and Anthropic docs: Major AI companies publish their crawler documentation, including new user-agents
We'll update this page as new crawlers are discovered. Bookmark it or check back monthly.
Frequently Asked Questions
How many AI crawlers are there in 2026?
As of April 2026, there are at least 20 known AI crawlers actively crawling the web. The major ones are GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google/Gemini), PerplexityBot (Perplexity AI), and Bytespider (ByteDance). New ones appear regularly as more companies build AI products that need web data.
Do all AI crawlers respect robots.txt?
Most reputable AI crawlers from major companies (OpenAI, Anthropic, Google, Perplexity, Apple, Amazon) respect robots.txt rules. However, some crawlers, especially those from smaller or less transparent companies, may not fully comply. For stronger protection, you can combine robots.txt with server-level IP blocking or HTTP header-based restrictions.
What's the difference between AI training crawlers and AI search crawlers?
AI training crawlers (like GPTBot, Google-Extended, CCBot) collect content to train language models. Once trained, your content is baked into the model permanently. AI search crawlers (like ChatGPT-User, PerplexityBot) fetch pages in real-time when users ask questions, similar to how Google fetches pages for search results. Some crawlers like Bytespider do both.
Should I block all AI crawlers or just some?
It depends on your goals. If you want maximum content protection, block all AI crawlers. If you want your site to appear in AI-powered search results (ChatGPT browsing, Perplexity answers), you might want to allow search-focused crawlers while blocking training-only crawlers. A common middle ground: block all training crawlers but allow ChatGPT-User and PerplexityBot for real-time search visibility.
How do I check which AI crawlers are visiting my site?
Check your server access logs for AI crawler user-agent strings. Look for GPTBot, ClaudeBot, PerplexityBot, Bytespider, and others in your log files. Alternatively, use the ZeroKit.dev AI Readiness Checker to scan your robots.txt and see which crawlers are currently blocked or allowed on your site.