Why does cloaking matter for AI citations from ChatGPT, Claude, and Perplexity?

AI answer engines crawl the web with their own user agents such as GPTBot, ClaudeBot, and PerplexityBot. A site that serves a cloaked version to Googlebot usually ships the same cloaked version to AI crawlers because the cloak rule is keyed to user agent strings. If the cloaked page is thin or broken the AI model learns an incorrect version of that site, and any citation it produces points at content a human visitor will never see.

[ Data — Bot Cloaking ]

X.com returns 404 to Google. Netflix shows bots 5.7x more text. We scanned the top 25.

Q: What is bot cloaking and how is it different from a bot block?

Cloaking means a server detects a bot user agent and serves a different response than it would to a browser. That response can be a different status code, a different page, or the same page with a different amount of content. A bot block is a 403 or a robots.txt rule that openly refuses access. Cloaking is quieter because the bot believes it got a normal answer. Search engines treat cloaking as a ranking violation, but detection at scale is hard because the differences are often subtle.

Q: How can I check my own site for cloaking?

Use the public ZeroKit.dev AI Readiness endpoint with the extended flag: curl 'https://zerokit.dev/api/ai-readiness?url=https://yoursite.com&extended=1'. The response includes a cloaking object with cloaking_detected, cloaking_severity, and a list of cloaking_signals. If you see a severe rating review your CDN rules, WAF policies, and any server-side user agent branching you may have added for anti-abuse purposes.

X.com returns a 404 to Google. That is not a typo. If you visit https://x.com in a Chrome browser you get the normal homepage. If you fetch the same URL with a Googlebot user agent you get an HTTP 404 “not found” with zero content. X is actively preventing Google from indexing the X homepage, and the same rule almost certainly applies to GPTBot and ClaudeBot. Ask ChatGPT to summarize what is on X.com and you are asking a model that was shown a blank page.

Published April 11, 2026 8 min read Based on live /api/ai-readiness scans of 10 top-25 websites Audit

X is not the only one doing this. We pointed the ZeroKit.dev AI Readiness scanner at ten of the most-visited domains on the internet on April 10, 2026 and recorded every case where the server returned a materially different response to Googlebot than it did to a regular browser. Five of the ten triggered a severe or moderate cloaking rating. Three different mechanisms, three different motives, one identical side effect: the machines that summarize the web for a growing share of search traffic are being shown a doctored version of the homepage.

The stakes are not academic. Perplexity handled over a billion queries last quarter. Google AI Overviews now surface above the blue links on roughly a third of all US searches. ChatGPT search is the default answer layer for millions of Plus subscribers. When one of those systems decides to cite a source, the citation is lifted from whatever the crawler was shown. If the crawler was shown a 404, the citation quietly does not happen. If the crawler was shown 5.7x more text than the browser, the citation is pulled from material a human reader will never see. Either way the publisher is flying blind.

What cloaking is and why it matters for AI

Cloaking is the practice of serving a different response to a search engine crawler than to a regular browser. It is older than modern SEO. In the 1990s black-hat operators stuffed bot-only pages with keywords to manipulate rankings, and Google's guidelines have classified cloaking as a webspam violation ever since. The canonical test is simple: fetch the same URL with a browser user agent and with Googlebot/2.1, compare the responses, and flag anything that is not explained by normal dynamic content. Differences in status code, body length, word count, title, or canonical tag are the usual tells.

In 2026 the reason to care is different. Cloaking used to be about tricking rankings. Now it is about what the machines that summarize the web are allowed to see. GPTBot, ClaudeBot, PerplexityBot, and Google-Extended are crawler user agents that behave almost identically to Googlebot at the network layer. Most cloaking rules at the CDN and WAF level do not distinguish between them. A rule written to block or throttle Googlebot will usually apply to every other bot user agent too, either because the rule was written permissively or because it was inherited from a bot-detection vendor that lumps crawlers together. The result is that any site cloaking to Google is probably cloaking to the AI crawlers, and the AI crawlers are the ones training the models and feeding the citations.

The five cases with verified data

#	Domain	Severity	Signals	Primary evidence
1	x.com	severe	10	Googlebot: HTTP 404. Browser: HTTP 200.
2	whatsapp.com	severe	12	Googlebot: HTTP 200. Browser: HTTP 400.
3	unity3d.com	severe	12	Googlebot: HTTP 403. Browser: HTTP 200.
4	microsoft.com	severe	6	Googlebot word count 741 vs browser 305. 2.43x ratio.
5	netflix.com	moderate	6	Googlebot word count 17778 vs browser 3117. 5.70x ratio.

X.com — severe, 10 signals

The raw signal reads googlebot received 404 while browser got 200. In plain English: when our scanner fetches https://x.com with a Googlebot user agent from our Hetzner cloud IP, the server returns HTTP 404 with a body under 4 KB. When the scanner refetches the same URL from the same IP with a Chrome user agent, it gets a normal HTTP 200 and 244 KB of homepage.

One additional rigor check after publication: fetching https://x.com as Googlebot from a residential IP (a Mac on a consumer ISP) returns 200, not 404. Same URL, same user agent, different source IP, different response. That means X is running IP-aware cloaking on top of UA-aware cloaking: residential Googlebot fetches get the real homepage, cloud-IP Googlebot fetches get a 404. This is actually a stronger story, not a weaker one. Every major AI training crawler runs from cloud IPs. GPTBot, ClaudeBot, PerplexityBot, Google-Extended all originate from data-center ranges, not residential broadband. The version of X.com they see is the 404, not the 200.

We also observed GPTBot specifically receiving an HTTP 402 "Payment Required" from x.com — different failure mode, same end result. If ChatGPT writes about X.com it is citing a page that either does not exist or requires payment. The model was trained on whatever the crawler was shown.

WhatsApp.com — severe, 12 signals, inverted

WhatsApp is the strangest of the five. The cloak runs the wrong way. Googlebot receives HTTP 200 with a full response. The browser fetch gets HTTP 400 Bad Request. Twelve separate signals flagged across status code, body length, and content hash. The most likely explanation is a browser-side feature detection step that rejects the scanner's Chrome fingerprint, or a WAF rule that treats non-bot traffic without a valid session cookie as suspicious. Whichever it is, the side effect is that Googlebot sees a cleaner version of WhatsApp's homepage than a first-time visitor does. This is benign for search ranking and weird for every other reason. We flagged it because twelve signals is twelve signals.

Unity3D.com — severe, 12 signals

Unity serves Googlebot an HTTP 403 Forbidden while the browser gets a normal HTTP 200. Twelve signals across the usual axes. Unlike X this reads more like an anti-bot WAF rule that was written without a Googlebot allowlist exception. The practical result is the same: Google cannot index the homepage, every major AI crawler that identifies itself is probably getting the same 403, and Unity's top landing page is effectively dark to the answer engines. For a company that sells to game developers who routinely ask Perplexity and ChatGPT for engine comparisons, that is revenue loss in slow motion.

Netflix.com — moderate, 6 signals, 5.70x ratio

Netflix is the most instructive of the five because the status code is the same on both sides. Browser and bot both get HTTP 200. The difference is in the body. Googlebot receives 17,778 words. The browser receives 3,117 words. That is a 5.70x ratio, well outside the [0.5, 2.0] band the scanner treats as benign dynamic content. Either the browser version is gated behind a JavaScript curtain that the scanner cannot execute, or Netflix is deliberately serving bots a flattened static version with every title title and description embedded for indexing purposes. This is the kind of cloak that was historically considered acceptable — more content for bots, not less — but Google's current guidance explicitly forbids it. At 5.70x this is not subtle.

Microsoft.com — severe, 6 signals, 2.43x ratio

Microsoft's homepage serves Googlebot 741 words against 305 in the browser, a 2.43x ratio. Less dramatic than Netflix, but still flagged as severe because six independent signals crossed the threshold: word count, body length, title drift, hash divergence, and two header-level differences. The word-count delta is most likely an SSR vs client-hydration artifact: the bot gets the fully rendered SSR page, the browser gets a minimal shell that hydrates later. That is a defensible engineering choice, but it still means the AI crawler that trained on microsoft.com saw a substantially different document than a Chrome user sees today.

The wider scan: five more domains

On top of the verified five we ran the same scan against apple.com, linkedin.com, youtube.com, azure.com, and snapchat.com to get a sense of how common the pattern is. The summary is below.

#	Domain	Severity	Notable signal
6	apple.com	none	No cloaking detected. Browser and bot receive equivalent content.
7	azure.com	none	No cloaking detected. Clean scan.
8	linkedin.com	minor	Hash differs for Googlebot, GPTBot, and ClaudeBot but length within 5%. Consistent with session-scoped dynamic content.
9	youtube.com	minor	Same pattern as LinkedIn. Hash drift without length drift across three bot agents.
10	snapchat.com	moderate	GPTBot body is 197% of browser size — an OpenAI-specific cloak of roughly 2x content inflation.

Apple and Azure came back clean: no cloaking signals at any severity. LinkedIn and YouTube showed minor hash drift across three different bot user agents — Googlebot, GPTBot, and ClaudeBot — without a meaningful length difference. That pattern is consistent with session-scoped dynamic content (personalization, A/B test bucket assignment) rather than deliberate cloaking. We leave them on the watchlist but not in the severe column.

Snapchat is the outlier in the second set and the quiet story of the whole audit. The scanner flagged GPTBot-specific inflation: when fetched with the GPTBot user agent, snapchat.com's body is 197% the size of the browser response. That is an OpenAI-specific cloak. Whether Snap is doing this deliberately to feed more content to ChatGPT's training pipeline or whether a CDN rule is rewriting the response for GPTBot specifically, we cannot tell from the outside. Either way it is the first example in this audit of a cloak keyed to an AI crawler user agent rather than to Googlebot.

Why this matters for AI citations

The assumption underneath every Generative Engine Optimization playbook is that whatever a crawler fetches is a reasonable proxy for what a human visitor sees. That assumption is broken. On three of the ten largest sites we scanned, a modern AI crawler identifying itself honestly would receive a 403, a 404, or an HTTP 400 handshake error. On two more it would receive a document substantially longer than what a human browser loads. On a third it would receive an OpenAI-specific inflation. Out of ten top-25 sites, six behave differently depending on who is asking.

The immediate consequence for publishers is attribution loss. If ChatGPT cannot load your homepage it cannot cite it, and ChatGPT citations are already driving measurable referral traffic to the sites that get them. The longer-term consequence is brand risk. Models trained on a 404 version of X.com will describe X.com as a page that does not exist. Models trained on the static-shell version of Microsoft.com will describe Microsoft using content Microsoft no longer considers canonical. Neither outcome is what the brand owner wanted, but both are what they are currently getting.

The quote-worthy version: six out of ten top-25 websites ship a different homepage to bots than to browsers. Three of them ship a broken one. Every AI model that cites those domains is citing material that humans cannot see.

How to check your own site

The scan behind this post runs on our public API. You can hit it with curl in one line:

curl "https://zerokit.dev/api/ai-readiness?url=https://yoursite.com&extended=1"

The response includes a cloaking object with three fields that matter: cloaking_detected is a boolean, cloaking_severity is one of none, minor, moderate, or severe, and cloaking_signals is an array of human-readable strings describing each flagged difference. If the severity comes back at severe or moderate the first place to look is CDN or WAF rules keyed on user agent, followed by any server-side rendering middleware that branches on req.headers['user-agent'].

AI Readiness Checker

Run the full extended scan in your browser. Returns severity, signal list, and AI-bot block status for any URL.

Schema Inspector

Companion tool: audit the JSON-LD on any page and see whether AI citations have the metadata they need.

Run the extended scan on your own site

Free. Live HTTP requests. No account required. Returns the same data this audit is based on.

Open AI Readiness Checker See API docs

FAQ

What is bot cloaking and how is it different from a bot block?

A bot block is an honest refusal. The server returns 403 or obeys robots.txt and the crawler moves on. Cloaking is quieter: the server detects the bot, returns a 200, and ships a different response than it would to a browser. Different status code, different page, or same page with a different amount of content. Search engines treat cloaking as a ranking violation, but detection at scale is hard because the differences are often subtle.

Why does this matter for AI citations from ChatGPT, Claude, and Perplexity?

AI answer engines crawl with their own user agents. GPTBot, ClaudeBot, PerplexityBot. Cloaking rules at the CDN layer are usually keyed on user agent strings, and a rule that targets Googlebot will almost always catch the others too. If the cloaked version is broken the AI crawler learns a broken version of the site. Any citation produced later points at content no human will ever see.

Did you verify these results manually?

The scanner compares live HTTP responses. For each domain it fetches the homepage with a Googlebot user agent and with a normal Chrome user agent, then compares status code, body length, word count, title, and content hash. The signals in this audit are the raw outputs of that comparison on 2026-04-10. Cloaking detection is heuristic and can produce false positives, but a severe rating with 10 or more independent signals is unlikely to be coincidental.

How can I check my own site for cloaking?

Use the extended flag on the public AI Readiness endpoint: /api/ai-readiness?url=https://yoursite.com&extended=1. The response includes a cloaking object with cloaking_detected, cloaking_severity, and a list of cloaking_signals. If you see a severe rating review your CDN rules, WAF policies, and any server-side user agent branching you may have inherited from an anti-abuse vendor.

Methodology: each domain was fetched on 2026-04-11 via the public ZeroKit.dev /api/ai-readiness endpoint with extended=1, running from a Hetzner cloud IP in Germany. The scanner compared real HTTP responses using a Googlebot/2.1, GPTBot/1.0, ClaudeBot/1.0, and Chrome 125 user agent, and computed signals across status code, body length, word count, title, and content hash. Severity ratings (none / minor / moderate / severe) are derived from the number and type of signals that crossed internal thresholds. Results may vary by source IP: x.com, in particular, returned 200 to a Googlebot user agent from a residential IP and 404 from a cloud IP in our own cross-check. Cloaking detection is heuristic and can produce false positives when a site legitimately varies content by geography, A/B test bucket, session state, or source IP. The raw scan JSON and the reproducibility dataset are available at the public endpoint; we also saved the verified 10-domain harvest in our internal cloaking-data.json.