What an AI crawler checker actually checks
AI crawlers follow the same decades-old rulebook as search bots: the Robots Exclusion Standard. Before fetching your pages, a well-behaved crawler looks up `https://yoursite.com/robots.txt`, finds any rules that name its user-agent, and obeys the `Disallow` directives it finds. So "checking your AI crawlers" comes down to two questions: **which AI bots are named in your robots.txt, and is each one allowed or disallowed?**
A crawler is **blocked** when your file has `User-agent: <token>` followed by `Disallow: /` (or a path that covers the page in question). It's **allowed** when no matching disallow applies. The substance of the check is knowing the right tokens to look for — because each AI company runs several bots with different jobs, and they're controlled separately:
- 1
Training bots
They absorb your content into a model and send nothing back. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI-training opt-out token). Most sites block these.
- 2
Search / citation bots
They index you so the engine can cite and link you in answers. OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot (Perplexity). These are the bots that make you citable — keep them open.
- 3
User-triggered bots
They fetch one page live when a person asks the assistant about it. ChatGPT-User (OpenAI), Claude-User (Anthropic). They act on a human's request, in real time.
The honest caveat: robots.txt is a request, not a firewall
Compliance with robots.txt is voluntary — the file states a preference; it doesn't technically block access. The difference that matters: the major, named crawlers from OpenAI, Anthropic, Google and Perplexity publicly commit to honoring it, so for them the check is meaningful. Anonymous scrapers can ignore it. That's why robots.txt is the right place to start, and your own server logs (below) are where you confirm what actually happened.
How to check your AI crawlers yourself (free, by hand)
This isn't a secret, and you don't need a tool for the basic read. Here's the exact manual method — it genuinely works:
- 1**Open your robots.txt.** Go to `https://yoursite.com/robots.txt` in any browser. Every check starts here — this is the file every AI crawler reads first.
- 2**Find the AI-bot tokens.** Look for lines like `User-agent: GPTBot`, `User-agent: OAI-SearchBot`, `User-agent: ClaudeBot`, `User-agent: PerplexityBot`. For each one, check whether a `Disallow:` line follows it.
- 3**Read the verdict.** `Disallow: /` under a token = that crawler is **blocked** from your whole site. A `Disallow:` with a specific path blocks only that path. No matching `Disallow` = that crawler is **allowed**.
- 4**Check the wildcard rule.** A `User-agent: *` block applies to any crawler that doesn't have its own named rule. A stray `Disallow: /` there can silently block AI bots you never named — a common accidental block.
- 5**Confirm you're blocking *current* tokens.** Make sure you're using live names like `ClaudeBot` — not the deprecated `anthropic-ai` or `Claude-Web`, which Anthropic has retired. A "block" built only on dead tokens blocks nothing.
Where the manual method runs out of road
The DIY read is real, but it has honest limits: you have to know every current token (the list keeps growing), robots.txt syntax is easy to get subtly wrong on path scope, ordering and wildcards, and — most important — a raw read can't tell you whether your *strategy* is right. It shows you that a line exists; it can't tell you that blocking OAI-SearchBot just deleted you from ChatGPT search. That judgment is exactly what the audit adds.
Skip the token-by-token reading. The free AI SEO audit checks every current AI-crawler token at once, validates the robots.txt rules, and tells you in plain English which blocks are smart and which are costing you citations — no card.
Run the free crawler checkThe point of the check: block training, allow citations
A crawler checker isn't just a yes/no — its real value is catching a strategy mistake. The instinct, once people learn AI bots are crawling them, is to block all of it. That's the most expensive move in this whole topic. Modern best practice is not "block all AI." It's **block the training crawlers, allow the search and citation crawlers**:
- **Block training** (`GPTBot`, `ClaudeBot`, `Google-Extended`) — these absorb your content into models with no link back. Blocking them keeps you out of training. It does **not** hurt your Google rankings, and it does **not** affect whether you can be cited.
- **Allow search / citation** (`OAI-SearchBot`, `Claude-SearchBot`, `PerplexityBot`) — these are the bots that retrieve you when the engine writes an answer. Block them and you're uncitable: you disappear from ChatGPT, Claude and Perplexity answers entirely.
- **Allow user-triggered fetches** (`ChatGPT-User`, `Claude-User`) — these fire when a real person asks the assistant about your page. Blocking them stops users from pulling your page into a chat.
The mistake a raw robots.txt read won't flag
A blanket `User-agent: *` + `Disallow: /` aimed at "AI," or a copy-pasted "block all AI bots" snippet, quietly takes out OAI-SearchBot and PerplexityBot along with the training bots. The line looks intentional. The result is that you've removed yourself from AI search — the exact visibility you were trying to protect. A checker that knows each token's job catches this; eyeballing the file usually doesn't.
And there's the Anthropic gotcha worth its own line. If your "AI block" only lists `anthropic-ai` or `Claude-Web`, you've blocked nothing current — those tokens are deprecated. Current Anthropic crawling runs under `ClaudeBot`, `Claude-User` and `Claude-SearchBot`. It's a perfect example of why "I added a block" and "the block works" are two different facts, and why a real check beats an assumption.
Why AI-crawler access is now a first-order decision
This stopped being a niche concern. AI crawlers have become a dominant slice of non-human traffic, and their volume is climbing fast — which is exactly why getting your access policy right is no longer a footnote.
+305%
Growth in GPTBot requests year over year — its share of AI-crawler traffic jumped from 5% (May 2024) to 30% (May 2025), per Cloudflare network data.
As of mid-2025, Cloudflare's network data put GPTBot at roughly 7.7% and ClaudeBot at 5.4% of all search-and-AI crawler traffic, with about 80% of AI bot activity going to training. The catch that explains the "block training" instinct: the crawl-to-referral gap is brutal. As of mid-2025, Cloudflare measured Anthropic crawling on the order of 38,000 pages for every one visitor it referred, OpenAI around 887:1, and Perplexity around 118:1. Those figures drift quarter to quarter, but the direction is the point — training crawlers take far more than they send back, which is why most sites block them while keeping the citation bots open.
Blocking is now mainstream, not fringe. One industry study found roughly 25% of the top 1,000 websites block GPTBot — up from about 5% in early 2023 — and around 79% of top news sites block AI training bots, with GPTBot the single most-blocked AI crawler. The risk in all that blocking is doing it bluntly: block the training bots, fine; block the citation bots by accident, and you've opted out of AI visibility without meaning to.
The upside of getting access right
Letting the right crawlers in is the precondition; making your pages worth citing is the payoff. The peer-reviewed GEO paper (Aggarwal et al., KDD 2024) found that answer-first structure, citable statistics and clear sourcing can lift a page's visibility in AI answers by up to 40% — but only if the crawlers can read it in the first place. A companion move: publish an llms.txt file (the standard proposed by Jeremy Howard in September 2024), a curated Markdown map that hands AI engines your best content in a format their context windows can actually digest.
The honest contrast: reading robots.txt vs the free audit
The manual read is a real method and we'd rather say so than pretend the audit is magic. The difference is coverage and judgment — what the audit catches that a token-by-token read tends to miss. Here's the honest comparison, including what the free audit does *not* do.
| Reading robots.txt by hand | SourceWatch free AI SEO audit | |
|---|---|---|
| Checks current AI-crawler tokens | Only the ones you remember to look for | Yes — every current token at once |
| Catches deprecated-token mistakes (anthropic-ai / Claude-Web) | Only if you know they're dead | Yes — flags blocks that block nothing |
| Flags accidentally blocking search/citation bots | Rarely — the line looks intentional | Yes — the strategy mistake, called out |
| Validates robots.txt syntax + path scope | Up to you to get right | Yes — checks the rule, not just its presence |
| Ties access into your wider AI-SEO posture | No — it's one file | Yes — entity recognition, answer-readiness, fixes |
| Plain-English fixes | No | Yes — prioritized, no 40-page PDF |
| Inline live widget on this page | — | No — the working check is at /ai-seo-audit |
| Scope | Whatever you paste | One page free; full site on trial |
| Who actually crawled you (first-party, verified) | No — robots.txt only states intent | On trial — real AI bots + referral clicks, verified vs vendor IPs |
| Works inside Claude Code (MCP) | No | Yes (on a plan) — agent can read it and act |
In short: reading robots.txt tells you what you *declared*; the free AI SEO audit tells you whether the declaration is *correct and complete* — and what to change. Different depth, same starting question.
The deeper check: who actually crawled you
Here's the limit every robots.txt checker shares, free or paid, ours included: robots.txt is a statement of *intent*. It tells you which bots you allowed — not which bots actually showed up, and not whether the thing in your logs calling itself "GPTBot" was real or a scraper wearing its name. Closing that gap needs first-party data, and it's where SourceWatch does something a robots.txt read structurally can't.
Moat 1 — First-party AI-crawler traffic, verified, not assumed
When an AI engine reads or cites your site, its crawler hits your pages and its answers send real referral clicks. SourceWatch captures both from your own first-party data — via a one-line snippet or a Cloudflare Worker — and verifies each hit against the vendors' published IP ranges. So a spoofed user-agent pretending to be GPTBot doesn't pollute your numbers, and you see the truth your robots.txt can only hope for: which AI bots genuinely crawled you, how often, and which real visitors arrived from AI answers. That's the closed loop — you set a policy in robots.txt, then you watch, in your own AI traffic analytics, whether it's working.
Why "verified vs assumed" is the whole accuracy story
A user-agent string is just text — anyone can send "GPTBot" and lie. Matching the published IP ranges is the only way to know a crawler is real. Some peer tools watch bot crawls in logs; the piece almost no one captures is the first-party AI-*referral* click — the real person who arrived from an AI answer — separated from the spoofs. That's measured ground truth, not an inference from synthetic prompts.
Moat 2 — It works inside Claude Code (MCP-native)
SourceWatch ships an MCP server, so your crawler-access and AI-visibility data is readable and actionable from inside Claude Code — your assistant can pull which bots reached you, where you're blocked, and the citation gaps that follow, then help you fix the robots.txt and the content in the same loop. Among self-serve tools that's effectively unique; the comparable agent stack is enterprise-only and gated behind a separate subscription and a paid ChatGPT plan.
What SourceWatch does NOT do (so you can choose with eyes open)
It doesn't generate content for you — it tells you exactly what to fix and where the gaps are, but you (or your team, or your assistant) do the writing. There's no public REST API yet; access today is via MCP, with a REST surface on the roadmap. The free audit covers one page; full-site checks and ongoing first-party crawler-traffic capture are on the trial. And no honest tool can promise a Knowledge Panel or guaranteed ROI.
Found a problem? How to fix your crawler access
A bad result is a short to-do list, not a verdict. The fixes are concrete and mostly in your control:
- 1**Stop blocking the citation bots.** Confirm `OAI-SearchBot`, `Claude-SearchBot` and `PerplexityBot` are not under any `Disallow: /` — including an over-broad `User-agent: *` rule. This is the single most common silent killer of AI visibility.
- 2**Set your training policy on purpose.** If you don't want to feed the models, block `GPTBot`, `ClaudeBot` and `Google-Extended` deliberately — not by accident, and not in a way that also catches the search bots.
- 3**Use current tokens.** Replace any dead `anthropic-ai` / `Claude-Web` references with `ClaudeBot`, `Claude-User` and `Claude-SearchBot` so your block actually does something.
- 4**Check the layer robots.txt can't show you.** An over-aggressive firewall, WAF or CDN rule can silently block AI-crawler IP ranges even when your robots.txt is perfect. If the bots still aren't reaching you, look there.
- 5**Then verify in your logs.** robots.txt states intent; your first-party traffic is ground truth. Watch which AI bots actually arrive (verified, not spoofed) and whether they turn into referral clicks — that's how you know the policy is working.
That last step is where measurement closes the loop, and it's what SourceWatch is built for: it tells you whether ChatGPT, Perplexity, Gemini and Claude actually cite your brand — your AI visibility and share of voice against named competitors — and pairs it with the real, verified-vs-spoofed AI-crawler and AI-referral traffic landing on your site. See how it works, browse the features, or start with the free audit to see where you stand.
See whether AI crawlers can read you — and whether you're accidentally blocking the bots that make you citable. The free audit checks every current token, validates your robots.txt, and returns your top fixes.
Run the free AI crawler check