Research Report

State of AI Crawlers 2026: We Scanned the Top 100 Websites

Everyone talks about how to prepare your site for AI. Almost nobody shows what the biggest sites on the open web actually do. We ran a multi-signal audit against the top 100 content websites in April 2026. Here is what the data says.

Published April 10, 2026 Data window: April 10, 2026 Sample: 100 sites Original research

Key Findings

  • 31% of the top 100 have an llms.txt file. The standard is barely adopted at the top.
  • 21% have explicit rules for GPTBot in robots.txt (block or allow). The rest default to "allowed".
  • 10% block ClaudeBot, 12% block PerplexityBot.
  • 60% serve materially different content to at least one AI bot vs a browser.
  • 54% score "high" or "very high" on LLM knowability (Wikipedia + Common Crawl + DDG signals).
  • 30/100 is the average AI Readiness score across the sample.

Why We Did This

Every SEO blog and AI-governance newsletter gives you advice on preparing your site for AI crawlers. The advice is usually speculative: “you should block GPTBot,” “you should add an llms.txt,” “you should enforce structured data.” Nobody publishes the baseline — what actually happens at scale on the open web right now.

So we built a multi-signal scanner that combines five measurements and ran it against 100 of the largest content-producing websites. This post is the output: the numbers, the surprises, and the complete raw dataset.

100
sites successfully scanned
30
average core AI readiness score
31
sites with an llms.txt file
60
sites with AI bot cloaking detected

Finding 1: llms.txt Is Not Adopted at the Top

The llms.txt standard was proposed in late 2024 and by mid-2025 it was everywhere in SEO newsletters. One year later, among the top 100 sites, the adoption rate is 31%. That is 31 sites.

We found 31 sites with a valid, reachable llms.txt file at the root of their domain out of 100 scanned. That is less than what most SEO blogs have implied throughout 2025.

What this means: If you add an llms.txt to your site today, you are in a tiny minority — even at the top of the web. Not because llms.txt is controversial, but because most of the top sites have not heard of it, or have not prioritized it, or do not see a reason to when their content is already in every training set via Common Crawl.

Finding 2: AI Bot Rules Are Rare, and When They Exist, They Are Selective

We checked each site’s robots.txt for explicit rules (allow or disallow) for ten AI bots: GPTBot, ChatGPT-User, ClaudeBot, Claude-Web, Google-Extended, Bytespider, CCBot, FacebookBot, PerplexityBot, and Applebot-Extended.

AI Bots Explicitly Blocked Among Top 100

Percentage of top 100 sites that have an explicit Disallow: / rule for each bot.
Google-Extended
13% (13)
GPTBot
12% (12)
PerplexityBot
12% (12)
Bytespider
11% (11)
CCBot
11% (11)
ClaudeBot
10% (10)
ChatGPT-User
8% (8)
Claude-Web
7% (7)
Applebot-Extended
7% (7)
FacebookBot
6% (6)

Explicit blocks are concentrated on Google-Extended, GPTBot, PerplexityBot in that order. Even the most-blocked bot (Google-Extended) is only blocked by 13% of the sample. The other AI crawlers are ignored almost entirely.

What this means: If you are picking which AI bot to block first, the top 100 is not a reliable guide. They mostly don’t block anyone. When they do, Google’s Google-Extended is the most commonly targeted, narrowly ahead of OpenAI’s GPTBot.

Finding 3: Bot Cloaking Is Real, and Often Unintentional

We fetched each site’s homepage four times: once with a normal browser user-agent, once as Googlebot, once as GPTBot, and once as ClaudeBot. If the responses differed materially (different status code, drastically different body length, different title, or word count ratio outside 0.5–2.0), we flagged it as cloaking.

Results:

Cloaking Severity Distribution

How many sites serve different content to AI bots vs browsers.
No cloaking
40% (40)
Minor
25% (25)
Moderate
6% (6)
Severe
29% (29)

60 of 100 sites served materially different content to at least one AI bot. 6 of them explicitly blocked an AI bot (403/429) while serving a normal 200 to browsers. Most cases fall into two categories: (1) anti-spoofing — the site verifies bot identity by IP and rejects fake bot user-agents, and (2) personalization — the site runs A/B tests, geo-redirects, or user-specific rendering that produces slightly different bodies on each fetch.

A Specific Case: Wikipedia

Wikipedia returned 403 Forbidden to our spoofed Googlebot and spoofed ClaudeBot user-agents, while serving the normal page (2,300 words, 230 KB) to the browser and to GPTBot. This is not cloaking in the malicious sense. Wikipedia is verifying bot identity by IP — if you claim to be Googlebot but your request doesn’t come from a known Google IP range, Wikipedia refuses you. The same applies to ClaudeBot. GPTBot apparently has either whitelisted IPs or a different verification path.

This finding has a practical consequence: if you run a scraper or an AI agent that sends fake AI-bot user-agents from arbitrary IPs, you will get different (or no) content from Wikipedia than a browser would see. And Wikipedia is not alone.

Finding 4: Training Data Likelihood Is Almost Universal at the Top

We measured two independent signals of “how likely is this site in a major LLM’s training data”:

  1. Wayback Machine presence: how long the site has been archived, how many snapshots, whether it is still actively crawled. A proxy for Common Crawl inclusion and longitudinal content stability.
  2. Knowability signals: whether the domain is mentioned in Wikipedia search results, whether it is in the Common Crawl URL index, whether DuckDuckGo has an Instant Answer about it.

Out of 100 scanned sites, 54 (54%) landed in the “high” or “very high” knowability band — meaning they had Wikipedia mentions, Common Crawl presence, and/or DuckDuckGo Instant Answers.

LLM Knowability Level Distribution

Very high = Wikipedia + Common Crawl + DDG. Minimal = no signal.
Very high
0% (0)
High
54% (54)
Medium
23% (23)
Low
2% (2)
Minimal
21% (21)

What this means: Being "AI-ready" in the sense of having explicit rules and a nice llms.txt is almost orthogonal to whether an LLM actually knows about you. The top sites are in training data regardless of whether they configured anything — because they have been crawlable for years. For small or new sites, the reverse is true: you can have a perfect llms.txt and still be unknown to every LLM because you are not in Common Crawl, not in Wikipedia, and not in any knowledge base.

Finding 5: Average Core AI Readiness Score Is Lower Than You’d Expect

Using our AI Readiness Checker, which scores sites on a 0–100 scale across five categories (robots.txt for AI bots 30%, llms.txt 20%, structured data 25%, content citability 15%, AI meta directives 10%), the average score among the top 100 was 30/100.

Distribution of grades:

AI Readiness Grade Distribution

How the top 100 sites grade on our multi-page scanner.
A+
0% (0)
A
0% (0)
B
2% (2)
C
7% (7)
D
31% (31)
F
60% (60)

0 sites (0%) scored A or A+. 91 sites (91%) scored D or F. The rest cluster around B/C. Remember: a low core score does not mean the site is absent from LLM training data — see Finding 4 for why these two things are almost orthogonal.

Methodology

Sample selection

We took the Tranco list (April 2026), filtered out infrastructure and CDN domains (gstatic, googleapis, akamai, cloudflare, fastly, akadns, amazonaws, and similar), and kept the first 100 content-producing domains. Infrastructure domains were excluded because they do not produce content that is meaningful to scan for AI readiness.

Scanner

We used the ZeroKit AI Readiness Checker v2, which performs a multi-page crawl (homepage + up to 4 internal links), plus three extended signals: Wayback Machine historical analysis, knowability proxy (Wikipedia + Common Crawl + DuckDuckGo), and four-user-agent cloaking detection (browser, Googlebot, GPTBot, ClaudeBot). Full scanner source is open on GitHub.

What we did not measure

We did not query any LLM API directly to test whether the sites are “actually in” a specific model. That requires paid API access and per-model permission. Our knowability signal is a proxy. We also scanned homepages + up to 4 internal links per site, which is a sample, not a full crawl.

Limitations

Some sites blocked our scanner (403 / 429). Some timed out on the Wayback CDX query for sites with enormous archive history. We report the successful sample size in the dashboard above. Raw per-site results are in the downloadable CSV so you can see exactly what we got for each domain.

Download the Raw Data

Every number in this post is reproducible from the raw scan results. Per-site JSON, aggregated summary CSV, and the scanner source are all freely available.

Complete Dataset

All per-site results, the summary CSV, and the exact scanner version used for this report.

Download CSV Run the Scanner on Your Site

Run It On Your Own Site

The same scanner we used for this report is free to run on any site, including yours. No signup, no credit card, no API key. It returns the same multi-page core score plus the Wayback, knowability, and cloaking signals.

Free AI Readiness Check

Runs the full v2 scanner in your browser via our public API. 10-30 seconds per site.

Check Your Site

Related