Guide — AI Discovery

The 3 Files Every AI-Ready Website Needs in 2026: robots.txt, llms.txt, and Schema.org

Most websites are invisible to AI. Not because ChatGPT, Claude, and Perplexity cannot reach them — but because the sites never tell AI crawlers what to read, what to trust, or how to cite them. There are three files you control that decide whether an AI system picks your content over a competitor's. Two are old web standards repurposed for a new job. The third is brand new and still missing from 69% of the top 100 websites.

Published April 11, 2026 10 min read Based on a scan of the top 100 websites Practical guide

The three files in one sentence each

robots.txt — controls which AI bots are allowed to crawl your site at all.
llms.txt — gives language models a structured, link-rich summary of what your site contains.
Schema.org JSON-LD — ships machine-readable metadata about articles, FAQs, organizations, and entities.

This is not a theory post. Last week we ran a multi-signal scan against the top 100 websites on the open web and published every number we found in our AI Readiness Leaderboard. The data behind every claim in this post comes from those scans. Most sites have zero, one, or two of these files. Almost none have all three done well.

File #1: robots.txt for AI Bots

You probably already have a robots.txt. It is the file at the root of your domain that tells crawlers which paths they are allowed to fetch. It has been part of the web since 1994 and most CMSes generate one by default. The new job is harder. You now have to decide, bot by bot, whether you want your content used for training models, used for live answers, or both, or neither. These are different bots with different user agents, and the decisions are not symmetric.

Here are the ten AI bots that matter in 2026 and what each one does:

GPTBot — OpenAI's crawler for training data. If you block it, your content will not be used to train future GPT models (but old data is already in the set).
ChatGPT-User — OpenAI's live browsing agent. Triggered when a ChatGPT user asks a question and the model fetches your page in real time. Blocking this kills citations in ChatGPT answers.
ClaudeBot — Anthropic's training crawler.
Claude-Web — Anthropic's live browsing agent for Claude.ai conversations.
Google-Extended — Google's training crawler for Gemini and Vertex AI. Independent from the regular Googlebot. Blocking this does not affect Google Search.
PerplexityBot — Perplexity's crawler for live answer construction. Blocking this means your content cannot be cited in Perplexity answers.
CCBot — Common Crawl. Indirectly feeds nearly every large language model on the planet, including OpenAI, Anthropic, Meta, and Mistral. Blocking CCBot is the most powerful single move if your goal is to keep your content out of training sets.
Bytespider — ByteDance's crawler. Feeds Doubao and Douyin's AI features.
Applebot-Extended — Apple's training crawler for Apple Intelligence. Independent from the regular Applebot used for Spotlight and Siri.
Meta-ExternalAgent — Meta's crawler for Llama training and AI features inside Facebook, Instagram, and WhatsApp.

Here is the surprising number from our scan of the top 100 sites. The most-blocked AI bot is not GPTBot. It is not ClaudeBot. It is Google-Extended. More top-100 sites have an explicit Disallow rule for Google-Extended than for any other AI bot. People assume the news publishers and big media properties hate OpenAI more than anyone else. The data says they hate the idea of Google harvesting their content for Gemini even more — presumably because Google already takes a huge share of their traffic via Search and AI Overviews, and feeding training data on top of that feels like a bad trade.

If you want to allow every AI bot in (which is fine for most marketing sites and developer documentation), the entire file is three lines:

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

If you want a more nuanced setup — block training, allow live browsing so you still get cited — this is the pattern that works in 2026:

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow live answer engines so you still get cited
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

# Default for everything else
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

The order matters less than people think — AI bot user agents do not match the wildcard, so the specific rules win. What matters is being explicit. Sites with an empty or default-only robots.txt get whatever the bot operators decide today, which can change tomorrow.

Check which bots your robots.txt allows

Free scanner. Tests all 10 major AI crawlers against your live robots.txt and shows you exactly what each one is allowed to do.

Run the AI Readiness Checker

File #2: llms.txt — the newest standard

llms.txt is the youngest of the three. It was proposed by Jeremy Howard at Answer.AI in late 2024, formalized at llmstxt.org, and is meant to give language models a fast, structured map of your site. Think of it as a sitemap.xml written for a model instead of a search crawler — a single Markdown file at the root of your domain that lists your most important pages with one-line descriptions, grouped under headers.

Here is the data point that surprises everyone. Out of the top 100 websites we scanned, only 31 have an llms.txt file. That is 31%. This is one year after the standard went viral in SEO newsletters. Most major properties either have not heard of it, do not see a reason to add it (their content is already in every training set via Common Crawl), or have not prioritized it against more visible work. If you add a good llms.txt today, you are in a tiny minority — even at the very top of the web.

The structure is intentionally simple. An H1 with the project or site name. A blockquote with a one-sentence summary. Then sections grouped under H2 headers, each containing a Markdown list of links with descriptions. Here is a minimal but realistic example:

# Acme Docs
> Developer documentation for Acme's cloud API. REST, webhooks, and SDKs in eight languages.

## Quickstart
- [Install SDK](https://acme.example/docs/install): Package managers and CDN options.
- [Authentication](https://acme.example/docs/auth): OAuth 2.0 and API keys.
- [First request](https://acme.example/docs/first-request): A complete example in five lines.

## Reference
- [REST API](https://acme.example/docs/rest): All endpoints, parameters, and response shapes.
- [Webhooks](https://acme.example/docs/webhooks): Event types and signature verification.
- [Errors](https://acme.example/docs/errors): Status codes, retry policy, idempotency keys.

## SDKs
- [Python](https://acme.example/docs/sdk/python)
- [TypeScript](https://acme.example/docs/sdk/typescript)
- [Go](https://acme.example/docs/sdk/go)

What separates a useful llms.txt from a useless one comes down to a few habits:

Keep it short and link-rich. A good llms.txt is mostly links with one-line descriptions, not paragraphs. The job is to point a model at the right page, not to be the page.
Be honest about scope. If you sell a developer tool, do not list your blog posts about productivity. The blockquote should say what your site actually is.
Write descriptions a model would quote. "Authentication: OAuth 2.0 and API keys" is better than "Learn how to authenticate." The first is a citable fact, the second is filler.
Update it when you ship something. An out-of-date llms.txt is worse than no llms.txt — it tells models you have stopped paying attention.

For reference, GitHub's own llms.txt (one of the best in the wild) lists docs sections with one-line descriptions and points at the key REST and GraphQL reference pages. It is small, scannable, and exactly what a model needs to answer a question like "how do I create a pull request via the API."

Generate a starter llms.txt for your site

Free generator. Crawls your sitemap, suggests sections, and outputs a clean Markdown file you can drop at the root of your domain.

Open the llms.txt Generator

File #3: Schema.org JSON-LD — what AI actually cites

Schema.org JSON-LD is the oldest of the three. Google has rewarded it since 2015 for rich results in regular search. The new job is more important than the old one. When an answer engine pulls a quote out of your page and attributes it to your site, the metadata in your JSON-LD is what determines how it gets cited — whose name appears as the author, what date is shown, which logo represents the publisher. Without that, you become "according to a website" instead of "according to your brand."

You do not need every Schema.org type. Three of them carry roughly 60% of the value for AI citations:

Article / BlogPosting / NewsArticle

Tells answer engines who wrote it, when, what the headline is, and which image to show. Mandatory on every editorial page.

Organization

Defines who you are as a publisher. The logo, the canonical URL, the social handles via sameAs. One copy on every page.

FAQPage

The single biggest lift for AI citations. Answer engines pull FAQ entries verbatim and attribute them to the source page.

Here is what a clean Article schema looks like in 2026. Drop it inside a <script type="application/ld+json"> tag in your <head>:

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Why we rewrote our search index in Rust",
  "description": "How we cut p99 query latency from 320ms to 41ms by replacing our Elasticsearch tier.",
  "image": "https://example.com/img/rust-search-hero.png",
  "datePublished": "2026-04-11",
  "dateModified": "2026-04-11",
  "author": {
    "@type": "Person",
    "name": "Marie Doerr",
    "url": "https://example.com/team/marie"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Example Engineering",
    "logo": {
      "@type": "ImageObject",
      "url": "https://example.com/logo.png"
    }
  },
  "mainEntityOfPage": "https://example.com/blog/rust-search-rewrite"
}

And here is FAQPage schema with two real questions. Match the questions on the page exactly — answer engines cross-check them:

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "How long does deployment take?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "A typical deploy completes in 90 seconds, including build, push, and rolling restart across three regions."
      }
    },
    {
      "@type": "Question",
      "name": "Can I roll back a failed deploy?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Yes. Every deploy creates an immutable image. Rollback is one CLI command and takes under 30 seconds."
      }
    }
  ]
}

Now the surprise from our data. We built a new Schema Inspector last week and ran it against the homepages of the same top 100 sites. The New York Times homepage, which most people would assume is a structured-data fortress, scored only 25 out of 100 on AI citation coverage. They have Organization schema. They have ItemList schema for the front-page article slots. But the homepage has no NewsArticle schema — that lives on the individual article pages, which means the homepage itself is essentially uncitable. A model that lands on nytimes.com cannot tell who published what without following an extra link. This is fixable in an afternoon, but nobody has fixed it.

Inspect your JSON-LD coverage

Free analyzer. Pulls every schema block from your page, scores it against the AI citation checklist, and tells you exactly what is missing.

Open the Schema Inspector

How the three files work together

The three files are not redundant and they do not substitute for each other. robots.txt lets the crawler in. llms.txt tells it what your site is about. Schema.org tells it who wrote each page, when, and what kind of content it is. Without robots.txt, the bot sees nothing. Without llms.txt, the bot has to guess at structure. Without Schema.org, the bot has facts but no metadata to cite them with. You need all three or you are leaving citations on the table.

Here is the three-step plan if you want to start today:

Run the AI Readiness Checker on your domain. It scores your robots.txt, looks for llms.txt, audits your schema, and gives you a list of specific recommendations ranked by impact.
Missing llms.txt? Generate a starter with the llms.txt Generator. Edit the descriptions to be cite-worthy, then drop it at the root of your domain.
Weak structured data? Run the Schema Inspector on your most important pages. Add Article on editorial content. Add FAQPage anywhere you already have a Q&A section. Add Organization once, sitewide.

FAQ

Do I need all three files?

Yes. They serve different purposes. robots.txt permits or denies AI crawlers, llms.txt describes your site in a way models can parse, and Schema.org JSON-LD structures the facts on each page so answer engines can lift them. Skipping one weakens the other two.

Will llms.txt actually be used by ChatGPT?

OpenAI has not officially committed to honoring llms.txt. But Common Crawl, which feeds nearly every large language model, is archiving the file, and Anthropic has referenced the standard in its documentation. Adoption is growing faster on the model side than on the publisher side.

Does blocking GPTBot hurt my SEO?

No. Blocking AI training bots like GPTBot, ClaudeBot, or Google-Extended does not affect Googlebot or traditional search rankings. Google-Extended is a separate user agent specifically for Gemini training and is independent from the Google search crawler.

What is the fastest win for AI readiness?

Adding Article and FAQPage schema to your existing top content. It takes about an hour per template and typically produces a double-digit point improvement on AI citation coverage scores. Most sites have an Organization schema and nothing else.

Can I see how my site compares to others?

Yes. Run the AI Readiness Checker on your site, then compare against the public leaderboard of the top 100 sites. The leaderboard shows scores per signal so you can see where you sit relative to peers in your category.

One last thing

We built zerokit.dev because we wanted to see which of these files actually moved the needle. We scanned the top 100 sites, published the raw data, and built three free tools so anyone can do the same audit on their own domain in under a minute. No signup, no credit card, no API key. The tools are free. Use them.