Skip to content
Free Tool

Check whether AI crawlers can actually read your site.

An AI crawler checker answers one blunt question: when ChatGPT, Perplexity, Gemini and Claude send their bots to fetch your pages, does your site let them in — or is your `robots.txt` quietly turning them away? It matters because the rule is absolute: **if an AI engine's crawler can't read you, it can't cite you.** You can check this yourself by hand — we'll show you exactly how below, no gatekeeping. Or run SourceWatch's free AI SEO audit: enter one URL and it checks every current AI-crawler token at once, flags the blocks you didn't mean to set, and gives you the fixes. For the full reference on every bot, see AI crawlers explained.

TL;DR

  • **An AI crawler checker** tells you whether AI bots can fetch your pages — or whether your `robots.txt` is blocking them. Blocked means **uncitable**: you vanish from AI answers.
  • **You can check it yourself.** Open `yoursite.com/robots.txt`, look for rules targeting bot names like `GPTBot`, `OAI-SearchBot`, `ClaudeBot` and `PerplexityBot`, and see whether a `Disallow: /` follows. We give the full manual steps below.
  • **The strategic move isn't "block all AI."** It's **block training, allow search/citation.** Block `GPTBot`/`ClaudeBot` (training); keep `OAI-SearchBot`/`PerplexityBot` (the bots that make you citable) open.
  • **Two traps a raw read misses:** a blanket `Disallow: /` for everything AI also deletes you from ChatGPT and Perplexity search; and a block built only on the deprecated `anthropic-ai` / `Claude-Web` tokens blocks *nothing* current — Claude is still crawling you. An audit catches both.
  • **SourceWatch's free check is the AI SEO audit** — it tests every current crawler token at once, validates the robots.txt rules, and returns plain-English fixes. One page, point-in-time, no card.
  • **Moat:** SourceWatch also captures the AI crawlers and AI-referral clicks actually hitting your site — verified against vendor IP ranges so spoofed bots don't count. robots.txt shows what you *allowed*; this shows who *actually came*.

What an AI crawler checker actually checks

AI crawlers follow the same decades-old rulebook as search bots: the Robots Exclusion Standard. Before fetching your pages, a well-behaved crawler looks up `https://yoursite.com/robots.txt`, finds any rules that name its user-agent, and obeys the `Disallow` directives it finds. So "checking your AI crawlers" comes down to two questions: **which AI bots are named in your robots.txt, and is each one allowed or disallowed?**

A crawler is **blocked** when your file has `User-agent: <token>` followed by `Disallow: /` (or a path that covers the page in question). It's **allowed** when no matching disallow applies. The substance of the check is knowing the right tokens to look for — because each AI company runs several bots with different jobs, and they're controlled separately:

  1. 1

    Training bots

    They absorb your content into a model and send nothing back. GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI-training opt-out token). Most sites block these.

  2. 2

    Search / citation bots

    They index you so the engine can cite and link you in answers. OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot (Perplexity). These are the bots that make you citable — keep them open.

  3. 3

    User-triggered bots

    They fetch one page live when a person asks the assistant about it. ChatGPT-User (OpenAI), Claude-User (Anthropic). They act on a human's request, in real time.

The honest caveat: robots.txt is a request, not a firewall

Compliance with robots.txt is voluntary — the file states a preference; it doesn't technically block access. The difference that matters: the major, named crawlers from OpenAI, Anthropic, Google and Perplexity publicly commit to honoring it, so for them the check is meaningful. Anonymous scrapers can ignore it. That's why robots.txt is the right place to start, and your own server logs (below) are where you confirm what actually happened.

How to check your AI crawlers yourself (free, by hand)

This isn't a secret, and you don't need a tool for the basic read. Here's the exact manual method — it genuinely works:

  1. 1**Open your robots.txt.** Go to `https://yoursite.com/robots.txt` in any browser. Every check starts here — this is the file every AI crawler reads first.
  2. 2**Find the AI-bot tokens.** Look for lines like `User-agent: GPTBot`, `User-agent: OAI-SearchBot`, `User-agent: ClaudeBot`, `User-agent: PerplexityBot`. For each one, check whether a `Disallow:` line follows it.
  3. 3**Read the verdict.** `Disallow: /` under a token = that crawler is **blocked** from your whole site. A `Disallow:` with a specific path blocks only that path. No matching `Disallow` = that crawler is **allowed**.
  4. 4**Check the wildcard rule.** A `User-agent: *` block applies to any crawler that doesn't have its own named rule. A stray `Disallow: /` there can silently block AI bots you never named — a common accidental block.
  5. 5**Confirm you're blocking *current* tokens.** Make sure you're using live names like `ClaudeBot` — not the deprecated `anthropic-ai` or `Claude-Web`, which Anthropic has retired. A "block" built only on dead tokens blocks nothing.

Where the manual method runs out of road

The DIY read is real, but it has honest limits: you have to know every current token (the list keeps growing), robots.txt syntax is easy to get subtly wrong on path scope, ordering and wildcards, and — most important — a raw read can't tell you whether your *strategy* is right. It shows you that a line exists; it can't tell you that blocking OAI-SearchBot just deleted you from ChatGPT search. That judgment is exactly what the audit adds.

Skip the token-by-token reading. The free AI SEO audit checks every current AI-crawler token at once, validates the robots.txt rules, and tells you in plain English which blocks are smart and which are costing you citations — no card.

Run the free crawler check

Why AI-crawler access is now a first-order decision

This stopped being a niche concern. AI crawlers have become a dominant slice of non-human traffic, and their volume is climbing fast — which is exactly why getting your access policy right is no longer a footnote.

+305%

Growth in GPTBot requests year over year — its share of AI-crawler traffic jumped from 5% (May 2024) to 30% (May 2025), per Cloudflare network data.

As of mid-2025, Cloudflare's network data put GPTBot at roughly 7.7% and ClaudeBot at 5.4% of all search-and-AI crawler traffic, with about 80% of AI bot activity going to training. The catch that explains the "block training" instinct: the crawl-to-referral gap is brutal. As of mid-2025, Cloudflare measured Anthropic crawling on the order of 38,000 pages for every one visitor it referred, OpenAI around 887:1, and Perplexity around 118:1. Those figures drift quarter to quarter, but the direction is the point — training crawlers take far more than they send back, which is why most sites block them while keeping the citation bots open.

Blocking is now mainstream, not fringe. One industry study found roughly 25% of the top 1,000 websites block GPTBot — up from about 5% in early 2023 — and around 79% of top news sites block AI training bots, with GPTBot the single most-blocked AI crawler. The risk in all that blocking is doing it bluntly: block the training bots, fine; block the citation bots by accident, and you've opted out of AI visibility without meaning to.

The upside of getting access right

Letting the right crawlers in is the precondition; making your pages worth citing is the payoff. The peer-reviewed GEO paper (Aggarwal et al., KDD 2024) found that answer-first structure, citable statistics and clear sourcing can lift a page's visibility in AI answers by up to 40% — but only if the crawlers can read it in the first place. A companion move: publish an llms.txt file (the standard proposed by Jeremy Howard in September 2024), a curated Markdown map that hands AI engines your best content in a format their context windows can actually digest.

The honest contrast: reading robots.txt vs the free audit

The manual read is a real method and we'd rather say so than pretend the audit is magic. The difference is coverage and judgment — what the audit catches that a token-by-token read tends to miss. Here's the honest comparison, including what the free audit does *not* do.

Reading robots.txt by handSourceWatch free AI SEO audit
Checks current AI-crawler tokensOnly the ones you remember to look forYes — every current token at once
Catches deprecated-token mistakes (anthropic-ai / Claude-Web)Only if you know they're deadYes — flags blocks that block nothing
Flags accidentally blocking search/citation botsRarely — the line looks intentionalYes — the strategy mistake, called out
Validates robots.txt syntax + path scopeUp to you to get rightYes — checks the rule, not just its presence
Ties access into your wider AI-SEO postureNo — it's one fileYes — entity recognition, answer-readiness, fixes
Plain-English fixesNoYes — prioritized, no 40-page PDF
Inline live widget on this pageNo — the working check is at /ai-seo-audit
ScopeWhatever you pasteOne page free; full site on trial
Who actually crawled you (first-party, verified)No — robots.txt only states intentOn trial — real AI bots + referral clicks, verified vs vendor IPs
Works inside Claude Code (MCP)NoYes (on a plan) — agent can read it and act

In short: reading robots.txt tells you what you *declared*; the free AI SEO audit tells you whether the declaration is *correct and complete* — and what to change. Different depth, same starting question.

The deeper check: who actually crawled you

Here's the limit every robots.txt checker shares, free or paid, ours included: robots.txt is a statement of *intent*. It tells you which bots you allowed — not which bots actually showed up, and not whether the thing in your logs calling itself "GPTBot" was real or a scraper wearing its name. Closing that gap needs first-party data, and it's where SourceWatch does something a robots.txt read structurally can't.

Moat 1 — First-party AI-crawler traffic, verified, not assumed

When an AI engine reads or cites your site, its crawler hits your pages and its answers send real referral clicks. SourceWatch captures both from your own first-party data — via a one-line snippet or a Cloudflare Worker — and verifies each hit against the vendors' published IP ranges. So a spoofed user-agent pretending to be GPTBot doesn't pollute your numbers, and you see the truth your robots.txt can only hope for: which AI bots genuinely crawled you, how often, and which real visitors arrived from AI answers. That's the closed loop — you set a policy in robots.txt, then you watch, in your own AI traffic analytics, whether it's working.

Why "verified vs assumed" is the whole accuracy story

A user-agent string is just text — anyone can send "GPTBot" and lie. Matching the published IP ranges is the only way to know a crawler is real. Some peer tools watch bot crawls in logs; the piece almost no one captures is the first-party AI-*referral* click — the real person who arrived from an AI answer — separated from the spoofs. That's measured ground truth, not an inference from synthetic prompts.

Moat 2 — It works inside Claude Code (MCP-native)

SourceWatch ships an MCP server, so your crawler-access and AI-visibility data is readable and actionable from inside Claude Code — your assistant can pull which bots reached you, where you're blocked, and the citation gaps that follow, then help you fix the robots.txt and the content in the same loop. Among self-serve tools that's effectively unique; the comparable agent stack is enterprise-only and gated behind a separate subscription and a paid ChatGPT plan.

What SourceWatch does NOT do (so you can choose with eyes open)

It doesn't generate content for you — it tells you exactly what to fix and where the gaps are, but you (or your team, or your assistant) do the writing. There's no public REST API yet; access today is via MCP, with a REST surface on the roadmap. The free audit covers one page; full-site checks and ongoing first-party crawler-traffic capture are on the trial. And no honest tool can promise a Knowledge Panel or guaranteed ROI.

Found a problem? How to fix your crawler access

A bad result is a short to-do list, not a verdict. The fixes are concrete and mostly in your control:

  1. 1**Stop blocking the citation bots.** Confirm `OAI-SearchBot`, `Claude-SearchBot` and `PerplexityBot` are not under any `Disallow: /` — including an over-broad `User-agent: *` rule. This is the single most common silent killer of AI visibility.
  2. 2**Set your training policy on purpose.** If you don't want to feed the models, block `GPTBot`, `ClaudeBot` and `Google-Extended` deliberately — not by accident, and not in a way that also catches the search bots.
  3. 3**Use current tokens.** Replace any dead `anthropic-ai` / `Claude-Web` references with `ClaudeBot`, `Claude-User` and `Claude-SearchBot` so your block actually does something.
  4. 4**Check the layer robots.txt can't show you.** An over-aggressive firewall, WAF or CDN rule can silently block AI-crawler IP ranges even when your robots.txt is perfect. If the bots still aren't reaching you, look there.
  5. 5**Then verify in your logs.** robots.txt states intent; your first-party traffic is ground truth. Watch which AI bots actually arrive (verified, not spoofed) and whether they turn into referral clicks — that's how you know the policy is working.

That last step is where measurement closes the loop, and it's what SourceWatch is built for: it tells you whether ChatGPT, Perplexity, Gemini and Claude actually cite your brand — your AI visibility and share of voice against named competitors — and pairs it with the real, verified-vs-spoofed AI-crawler and AI-referral traffic landing on your site. See how it works, browse the features, or start with the free audit to see where you stand.

See whether AI crawlers can read you — and whether you're accidentally blocking the bots that make you citable. The free audit checks every current token, validates your robots.txt, and returns your top fixes.

Run the free AI crawler check

Frequently asked questions

What is an AI crawler checker?

An AI crawler checker tells you whether AI bots — from OpenAI, Anthropic, Google and Perplexity — can fetch your pages, or whether your robots.txt is blocking them. It works by reading the rules in your robots.txt for each AI user-agent token (GPTBot, OAI-SearchBot, ClaudeBot, PerplexityBot and others) and reporting which crawlers are allowed and which are disallowed. It matters because a blocked crawler can't read you, which means it can't cite you in AI answers.

How do I check if AI crawlers can access my site?

Open https://yoursite.com/robots.txt in any browser. Look for lines like "User-agent: GPTBot" or "User-agent: PerplexityBot" and check whether a "Disallow: /" follows each one — that means blocked. No matching disallow means allowed. Also check the "User-agent: *" wildcard rule, which applies to any bot without its own named rule. SourceWatch's free AI SEO audit at /ai-seo-audit does this automatically for every current token at once and tells you which blocks are intentional and which are costing you citations.

Should I block AI crawlers in robots.txt?

Almost never all of them. The modern best practice is to block the training crawlers (GPTBot, ClaudeBot, Google-Extended) so you don't feed the models for free, while allowing the search and citation crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot) so you stay visible in AI answers. A blanket block on all AI bots silently removes you from ChatGPT and Perplexity search — the visibility you were probably trying to protect.

Source: OpenAI — Overview of OpenAI crawlers
Does robots.txt actually stop AI crawlers?

For the major named crawlers, yes — OpenAI, Anthropic, Google and Perplexity publicly commit to honoring robots.txt. But compliance is voluntary: robots.txt states a preference, it doesn't technically block access, so anonymous scrapers can ignore it. That's why robots.txt is the right place to set policy, and your own server logs (verified against vendor IP ranges) are where you confirm what actually happened.

Source: Anthropic — Does Anthropic crawl the web, and how can owners block it?
I blocked anthropic-ai and Claude-Web — am I blocking Claude?

No. Those tokens are deprecated and inactive. Current Anthropic crawling runs under ClaudeBot (training), Claude-User (user-triggered fetch) and Claude-SearchBot (search indexing). A block that only lists the old anthropic-ai or Claude-Web tokens blocks nothing current — Claude is still crawling you. This is a common mistake and exactly the kind of thing an automated check catches that eyeballing your robots.txt does not.

Source: Anthropic — Does Anthropic crawl the web, and how can owners block it?
Will blocking AI crawlers hurt my Google rankings?

No. Blocking AI-training tokens like Google-Extended does not affect your inclusion or ranking in Google Search — Google states this in its own documentation. Training opt-out and search visibility are separate switches. The only thing that hurts your AI visibility is accidentally blocking the search and citation crawlers (OAI-SearchBot, Claude-SearchBot, PerplexityBot).

Source: Google Search Central — Google crawlers (Google-Extended)
How many sites actually block AI crawlers?

Blocking has gone mainstream. As of 2025, one industry study found roughly 25% of the top 1,000 websites block GPTBot (up from about 5% in early 2023), and around 79% of top news sites block AI training bots — GPTBot is the single most-blocked AI crawler. The volume justifies the caution: as of mid-2025, Cloudflare measured GPTBot requests up about 305% year over year, with roughly 80% of AI bot activity going to training rather than citation.

Source: Cloudflare — From Googlebot to GPTBot: who's crawling your site in 2025
How is SourceWatch more than a robots.txt reader?

A robots.txt check tells you what you allowed; it can't tell you which bots actually showed up, or whether a hit calling itself "GPTBot" was real. SourceWatch captures the AI crawlers and AI-referral clicks actually landing on your site from your own first-party data, verified against the vendors' published IP ranges so spoofed user-agents don't count. That's measured ground truth, not intent — and it ties into whether ChatGPT, Perplexity, Gemini and Claude are really citing you. There's also an MCP server so you can pull it all into Claude Code, though there's no public REST API yet.

Further reading

Keep reading

Enter one URL and see which AI crawlers can read you, which blocks are costing you citations, and your top fixes. No card.

Connect your first site and watch SourceWatch score your AI visibility in minutes.