[ AI READINESS GLOSSARY ]

25 terms every site owner should know

Plain-English definitions of the files, bots, schemas, and concepts that determine how AI systems see your website. Each term is 150-200 words and links to the ZeroKit tool that tests it, so you can move from "what is this?" to "how do I fix it?" in one click.

AI Overviews

Concept

Google's AI-generated answer boxes that appear above the regular search results. Launched in 2024, they quote or summarise content from indexed pages and attribute each claim with a citation link. A site that scores high on the other terms in this glossary — structured data, citable content, and a clean robots.txt — appears in Overviews far more often than one that does not.

The uncomfortable truth is that Overviews reward the same things traditional SEO rewards, just with less tolerance for marketing fluff. Text that answers a question in one paragraph, with a specific number or fact, is quoted verbatim. Generic copy is skipped in favour of a competitor whose copy is specific.

Answer engines

Concept

A catch-all term for the new generation of AI-first search experiences: Perplexity, ChatGPT Search, Google AI Overviews, Bing Copilot, Claude's browsing mode, and You.com. All of them share one behaviour that traditional search engines do not: they return an answer, not a list of links. Citations are secondary, often hidden behind a small chip or footnote.

For a site owner, this shifts the optimisation target. Instead of ranking for a keyword, you're trying to be the quoted source inside a generated paragraph. That means writing text a retrieval model can lift cleanly (specific, named entities, short paragraphs) and publishing the files answer engines check first (llms.txt, robots.txt, JSON-LD).

Applebot-Extended

AI crawler

The user-agent Apple uses to gather training data for Apple Intelligence, introduced in mid-2024. It is distinct from the older Applebot which powers Siri suggestions and Spotlight. The "-Extended" variant honours a separate User-agent: section in robots.txt, which means you can allow Siri but block training if that's the policy you want.

Coverage is still modest compared to GPTBot or ClaudeBot, but it grows every quarter. If you ship to an Apple-heavy customer base, blocking Applebot-Extended removes your content from Apple Intelligence's grounding while keeping it in macOS Spotlight results — a tradeoff worth making deliberately, not by omission.

Article schema

Schema.org

The JSON-LD type used to mark up a blog post, news article, or long-form editorial. Required fields are headline, author, datePublished, and image. Optional but high-value: dateModified, publisher, mainEntityOfPage, articleBody.

Answer engines use Article schema to decide whether a page is recent enough to quote for news or time-sensitive questions. Sites that ship dateModified and actually update it when they edit appear in "recent" answer contexts disproportionately often. Our Schema Inspector flags Article blocks that are missing the dateModified field.

BreadcrumbList

Schema.org

A JSON-LD block that describes the hierarchical path from the homepage to the current page — Home → Category → Subcategory → Post. Search engines use it to render the breadcrumb trail in the result snippet, which both saves horizontal space and tells AI crawlers where a page sits in your site's structure.

BreadcrumbList is cheap: 10 lines of JSON in a <script type="application/ld+json"> tag. It is one of the signals that pushes a site from a C to a B on our rubric because it gives a retrieval model a clean way to understand related content.

Bytespider

AI crawler

ByteDance's crawler, used to train the Doubao model family and to populate TikTok's in-app search and recommendations. It became notorious in 2023 for ignoring robots.txt at scale, and honest compliance improved only after public pressure and a major Cloudflare-led rule update.

Most Western sites still choose to block Bytespider explicitly in robots.txt. Our rubric treats an explicit Disallow for Bytespider as a positive signal — not because ByteDance is uniquely harmful, but because it shows the site owner actually thought about which AI bots they want to serve and made a deliberate call.

Canonical URL

Concept

A <link rel="canonical" href="..."> tag in the <head> of a page that tells search engines and AI crawlers "the definitive URL for this content is X." It is how you deduplicate the same article served at multiple URLs (tracking parameters, session IDs, www vs non-www).

AI crawlers use the canonical aggressively when they build their retrieval index. If two pages share the same canonical, only the canonical version is likely to be quoted. A missing canonical on a page that has several URL variants costs citations across every variant. Fix rate: one line per template, permanent win.

CCBot

AI crawler

The crawler operated by Common Crawl, a non-profit that publishes a monthly snapshot of the public web. CCBot doesn't serve an AI product directly, but its corpus is the single largest input to almost every Western large language model, including GPT, Claude, Llama, and Mistral.

Blocking CCBot is unusual and has a meaningful downside: you opt out of being in the dataset that trains most foundation models. Allowing CCBot is the default on our rubric and is worth one explicit allow line in robots.txt to signal that the decision was deliberate, not accidental.

Citability

Concept

Our term for whether a retrieval model can quote a page cleanly. A citable page has specific facts, short paragraphs, named entities, and server-rendered text (not JavaScript placeholders). A non-citable page hides its content behind an SPA shell, buries the answer in marketing fluff, or renders everything as images.

Citability is the most subjective category in the ZeroKit rubric but also the highest-leverage one. Two sites can have identical robots.txt, llms.txt, and schema and still land 30 points apart because one of them writes quotable paragraphs and the other writes brand copy. If you only fix one category on your site, fix this one.

ClaudeBot

AI crawler

Anthropic's training crawler for Claude. It identifies itself as ClaudeBot in the User-Agent header and honours robots.txt at the directory level. A separate user-agent, Claude-Web, is used by the in-product web browsing tool that fetches pages in response to a user prompt rather than for training.

Our rubric treats the two user-agents independently: blocking ClaudeBot while allowing Claude-Web means the model won't train on your content but can still cite it at query time when a user asks Claude about you. That is a reasonable policy for publishers who want traffic but not corpus inclusion.

Cloaking

Concept

Serving different content to different user-agents on the same URL. Traditionally this was a black-hat SEO tactic to hide spam from Googlebot while showing it to users. In the AI era, inverse cloaking has become common: sites return a full HTML response to Googlebot but a stripped-down page or an outright 404 to GPTBot and ClaudeBot.

Some of the inverse cloaking is deliberate policy (X.com, now owned post-2022, returns 404 to many AI bots). Some is accidental — a CDN rule that singles out a user-agent and blocks it without the content team knowing. Our Cloak Detector checks what each of the four major UAs actually sees.

Crawlability

Concept

Whether a crawler can reach a page in the first place — the most basic prerequisite for everything else. A page that 404s, times out, or sits behind a login is not crawlable. A page that renders server-side HTML is highly crawlable. A page that ships an empty <div id="root"></div> and hydrates via JavaScript is crawlable by Googlebot but often not by the simpler AI bots.

Check crawlability with a plain curl — if the HTML body contains the text you see in the browser, you're crawlable. If the body is empty and the content comes from a React or Vue bundle, you're relying on every crawler to run JavaScript, which most of them still don't.

FAQPage schema

Schema.org

The JSON-LD type for a page that answers specific questions. Each question is a Question object with a name and an acceptedAnswer. Google historically used FAQPage to render "People also ask" rich snippets, and AI Overviews continue to cite pages with FAQPage markup at a disproportionate rate because the Q&A structure maps cleanly to a user prompt.

Adding FAQPage to a page you already wrote is usually 20 minutes of work: pull out three or four common user questions and copy the answer text verbatim into a Question/acceptedAnswer block. The payoff is outsized.

Google-Extended

AI crawler

Google's opt-out user-agent for Gemini and Vertex AI training. It is not a separate crawler — the same Googlebot you've always known fetches the pages, but a User-agent: Google-Extended / Disallow: / line in robots.txt signals "do not use this content to train Gemini." The page still appears in Google Search and is still fetched by Googlebot for indexing.

This split is a compromise between Google's search product and its AI product, and it lets publishers keep search visibility without feeding the model. Allowing Google-Extended explicitly (rather than by default) is the signal our rubric looks for.

GPTBot

AI crawler

OpenAI's training crawler for GPT models, launched in mid-2023. It is the most common AI user-agent in production robots.txt files today. A separate user-agent, ChatGPT-User, is used by the in-product browsing tool that fetches pages on behalf of a user's chat query.

GPTBot honours robots.txt at the directory level and respects crawl-delay. Blocking GPTBot while allowing ChatGPT-User is the most common "have your cake and eat it" policy: it removes your content from training but leaves it available for citations when a user explicitly asks ChatGPT about your site.

JSON-LD

Schema.org

The JSON-based serialisation of Schema.org, embedded in an HTML page as <script type="application/ld+json">...</script>. It is the preferred format for structured data on modern sites because it decouples the markup from the rendered HTML — you can change the layout without re-wiring the metadata.

AI crawlers parse every JSON-LD block on a page and merge them into a single entity graph. A homepage can ship an Organization block, a WebSite block with SearchAction, and a BreadcrumbList block, and the model will read all three as one coherent description of the site. Our Schema Inspector shows every JSON-LD block on a page and flags missing fields.

llms-full.txt

Core file

The long-form companion to llms.txt. While llms.txt is a curated manifest of links, llms-full.txt contains the actual prose — the articles, the documentation, the key passages — inlined as markdown in a single file. Crawlers that want to understand the site in one fetch can download llms-full.txt and skip the HTML scraping entirely.

llms-full.txt is optional in the spec and only one extra point on our rubric, but it is a powerful signal when present. A 30 KB llms-full.txt tells an answer engine everything it needs to know about a documentation site in a single HTTP request, versus crawling 50 HTML pages.

llms.txt

Core file

A plain-markdown file at the root of a site (/llms.txt) that describes what the site is about, in a format optimised for large language models. The spec (llmstxt.org) requires an H1 with the site name, a blockquote summary, and H2 sections listing key pages as markdown links with one-line descriptions.

llms.txt is the single cheapest signal to fix and the hardest for competitors to copy — not because the format is hard, but because it forces the team to actually write down what the site is for. Perplexity, Claude, and ChatGPT all look for it on first visit. Our llms.txt Validator scores a live file 0-20 against the spec.

Meta robots

Concept

The <meta name="robots" content="..."> tag in the page <head> that controls per-page indexing policy. Common values: index,follow (default), noindex, nofollow, and the newer noai and noimageai extensions that some crawlers honour.

Meta robots overrides nothing in robots.txt — they are two separate mechanisms. Robots.txt says "do not fetch this URL at all"; meta robots says "you can fetch it but do not index the content." A well-configured site uses robots.txt for path-level rules and meta robots for exceptions inside an otherwise-indexable directory.

Organization schema

Schema.org

The JSON-LD type that describes the entity behind the site. Required field is name; high-value fields are url, logo, sameAs (array of your social/wiki/external profile URLs), and contactPoint. AI crawlers merge Organization blocks with the entity graph they already have about your brand, which is how you "teach" a model that your company exists.

The sameAs array is the most under-used field on the modern web. Listing your Wikipedia, GitHub, Crunchbase, and LinkedIn URLs there gives a retrieval model four independent confirmation signals that your site is the one associated with your brand name.

PerplexityBot

AI crawler

Perplexity's live-retrieval crawler. Unlike GPTBot or ClaudeBot, PerplexityBot is not primarily a training crawler — it fetches pages in response to user queries and uses them to ground answers in real time. Blocking it has an immediate and visible cost: your site disappears from Perplexity answers in minutes.

Because PerplexityBot is so citation-focused, it rewards sites with a clean llms.txt and structured data disproportionately. A well-marked page ends up as the top citation even when larger sites are returned in the same retrieval set. Our rubric gives full credit for an explicit allow.

RAG

Concept

Retrieval-Augmented Generation. The architecture answer engines use to answer questions with current information: a retrieval step pulls relevant pages from an index, and a generation step writes the answer using those pages as context. Perplexity, Claude's web search, ChatGPT's browsing, and AI Overviews are all RAG systems.

The retrieval step is where every term in this glossary actually matters. If your page is not in the index (crawlability), if it's not parseable (structured data), if it's not quotable (citability), the generation step never sees it and you never get cited. RAG is the reason "AI readiness" is a real job in 2026 and not a 2022 buzzword.

robots.txt

Core file

A plain-text file at /robots.txt that tells crawlers which paths they are allowed to fetch. The format has existed since 1994, but the AI era added dozens of new user-agents that can be addressed individually. A modern robots.txt lists at least GPTBot, ClaudeBot, Google-Extended, PerplexityBot, Applebot-Extended, CCBot, and FacebookBot, each with an explicit Allow or Disallow.

Our rubric gives 30 points for robots.txt (the largest single category) because it is the file every AI crawler reads first and caches for hours. A missing or overly broad robots.txt is the single biggest easy fix on a typical site. The AI Readiness Checker scores your current file in one click.

Schema.org

A shared vocabulary, maintained jointly by Google, Microsoft, Yahoo, and Yandex since 2011, for describing what a page is. Types include Article, Product, Event, Organization, Person, FAQPage, Recipe, Course, SoftwareApplication, and about 800 others. The vocabulary is serialised as JSON-LD, Microdata, or RDFa — JSON-LD is the modern default.

For AI readiness, Schema.org is the bridge between your HTML and the structured entity graph an LLM relies on to answer questions accurately. A product page without Product schema is just a blob of text to a retrieval model; the same page with schema is a database row it can quote with confidence.

sitemap.xml

Core file

An XML file at /sitemap.xml that lists every public URL on the site, with optional lastmod, changefreq, and priority hints. Referenced from robots.txt via a Sitemap: line. AI crawlers read the sitemap to discover pages that aren't linked from the homepage and to prioritise recently-updated URLs.

Sitemaps are especially valuable for programmatic content — per-host analysis pages, long-tail landing pages, dynamically-generated resources — where the crawl-from-homepage path would be too slow. Generating the sitemap from the database you already have (as ZeroKit does) means it stays in sync automatically.

Use the glossary

Filter by category at the top, or click a term in the table of contents to jump to it. Every entry links back to the ZeroKit tool that tests it — so you can go from "what is citability?" to "what's my citability score?" in two clicks.

API

All the terms above are implemented as checks in the /api/ai-readiness endpoint. The response JSON returns the score for each category plus a list of specific recommendations — use it if you want to build your own dashboard or CI check.

AI Readiness Checker · llms.txt Validator · Schema Inspector · Cloak Detector · Live leaderboard

Definitions reflect the state of AI crawler practice as of 2026-04-11. Terms evolve as new bots launch and existing bots change their policies — the page is updated when the rubric changes.

25 terms every site owner should know

[ JUMP TO TERM ]

AI Overviews

Answer engines

Applebot-Extended

Article schema

BreadcrumbList

Bytespider

Canonical URL

CCBot

Citability

ClaudeBot

Cloaking

Crawlability

FAQPage schema

Google-Extended

GPTBot

JSON-LD

llms-full.txt

llms.txt

Meta robots

Organization schema

PerplexityBot

RAG

robots.txt

Schema.org

sitemap.xml

Use the glossary

API

Related