Glossary

What are AI Crawlers?

An AI crawler is an automated bot, run by an AI company, that fetches your web pages for one of three distinct jobs: to **train** a model on your content, to **index** you so the AI can cite you in answers, or to **fetch a page live** when a person asks an assistant about it. The one thing most people get wrong: those three jobs are now controlled separately. You can let an engine cite you in ChatGPT, Perplexity and Claude answers while still refusing to feed its training — and the way you tell the bots apart is the user-agent name each one announces in your robots.txt.

Track which AI crawlers reach you with SourceWatch Run a free AI audit

TL;DR

**An AI crawler is a bot that fetches your pages for an AI engine** — to train a model, index you for citations, or fetch a page live for a user.
Those three jobs have **three separate controls.** You can block training and still be cited in AI answers.
**The expensive trap: blocking all AI bots makes you uncitable.** Block the *index/search* bots and you vanish from ChatGPT, Perplexity and Claude answers.
**The smart default: block training, keep search/index/user-fetch open** — feed no model, stay fully citable. It does *not* hurt Google rankings; Google and Apple say so in their own docs.
robots.txt is the switch — but **user-triggered fetchers (ChatGPT-User, Perplexity-User) can ignore it**, and bad actors like Bytespider often do. **Verify bots by published IP range, never the user-agent string.**

AI crawlers, defined

An AI crawler is an automated program operated by an AI company — OpenAI, Anthropic, Google, Perplexity, Apple and others — that visits your site and reads your pages on the engine's behalf. It works like a traditional search crawler, but it exists to power an AI product instead of a list of blue links. And unlike the old world, where Googlebot did essentially one job, today's AI crawlers split into three jobs you control independently.

1
Training crawlers
They harvest text to train or fine-tune the foundation model itself — GPTBot (OpenAI), ClaudeBot (Anthropic), Bytespider (ByteDance/TikTok). Blocking these keeps your content out of the model's training set.
2
Search / index crawlers
They build the index an engine searches when it writes an answer — OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic), PerplexityBot (Perplexity). These are the bots that make you citable. Block them and you disappear from AI answers.
3
User-fetch bots
They fetch one page live, the moment a person pastes a link or asks the assistant about a specific URL — ChatGPT-User, Perplexity-User, Claude-User. They act on a human's request, in real time.

Two names on this list aren't crawlers at all

Google-Extended and Applebot-Extended look like bots but make no requests and never appear in your logs. They are control tokens — robots.txt switches that govern whether content Google or Apple already crawled may be used for AI training and grounding. Treat them as directives, not visitors.

How you control them: robots.txt and its limits

The control layer is robots.txt — the plain-text file at the root of your domain. Each AI crawler announces a specific user-agent name, and you allow or disallow each one by that name. This is what makes granular control possible: you can disallow GPTBot (training) and allow OAI-SearchBot (citations) in the very same file. The training and index bots honor these rules.

But robots.txt is a request, not a firewall — and two things break the simple mental model. First, **user-triggered fetchers can ignore it by design.** Because a human asked for the page in real time, OpenAI's ChatGPT-User and Perplexity's Perplexity-User do not always honor robots.txt. Second, **bad actors ignore it entirely** — Bytespider (ByteDance/TikTok) is the most-cited offender and publishes no official documentation or IP ranges, making it the hardest crawler to control or even verify.

Verify by IP — and don't block by IP either

A user-agent name is just text, and trivially spoofed — a scraper can call itself "GPTBot" and lie. The only reliable way to confirm a bot is real is to check its IP against the operator's published ranges (OpenAI ships gptbot.json, searchbot.json and chatgpt-user.json; Perplexity and Anthropic publish their own). And don't block by IP: Anthropic warns that IP blocking "may not work correctly or persistently guarantee an opt-out, as doing so impedes our ability to read your robots.txt file." Block by user-agent, verify by IP.

The AI crawlers, one by one

Here are the bots that matter, what each one does, and what you give up by blocking it. The two starred rows are directives, not crawlers — flagged accordingly.

Bot (user-agent)	Operator	Job	Obeys robots.txt?	Blocking it means…
GPTBot	OpenAI	Trains GPT models	Yes	Not used for GPT training
OAI-SearchBot	OpenAI	Indexes for ChatGPT Search citations	Yes	You vanish from ChatGPT search answers
ChatGPT-User	OpenAI	Live user-requested page fetch	No (user-initiated)	Users can't pull your page into a chat
ClaudeBot	Anthropic	Trains Claude models	Yes	Not used for Claude training
Claude-SearchBot	Anthropic	Index / search quality	Yes	Reduced visibility in Claude answers
Claude-User	Anthropic	Live user-requested fetch	Yes	Users can't pull your page into Claude
PerplexityBot	Perplexity	Indexes to surface + link sites	Yes	You won't be linked or cited in Perplexity
Perplexity-User	Perplexity	Live user-requested fetch	No (user-initiated)	Users can't pull your page into Perplexity
Google-Extended *	Google	Directive: Gemini training + grounding	Token only	No Gemini training — Google Search unaffected
Applebot-Extended *	Apple	Directive: opt out of Apple AI training	Token only	No Apple AI training — Search/Siri unaffected
Bytespider	ByteDance / TikTok	Trains ByteDance models	Often ignored	Attempts to block training (hard to enforce)

* Google-Extended and Applebot-Extended are control tokens, not crawlers — they make no HTTP requests and never appear as a fetching user-agent in your logs. Note too that anthropic-ai is Anthropic's legacy training token: include it alongside ClaudeBot if you want to cover older references.

Should you block AI crawlers? (Almost never all of them)

The instinct to "block all the AI bots to protect my content" is the most expensive mistake in this whole topic. Blocking the **search/index** bots — OAI-SearchBot, Claude-SearchBot, PerplexityBot — doesn't protect you. It makes you invisible. OpenAI confirms that sites which opt out of OAI-SearchBot won't appear in ChatGPT search answers. Your content can't be retrieved, so it can't be cited, so you lose the AI visibility you're trying to build.

What most businesses actually want is granular: **block training, keep search and citations open.** You stop feeding the models for free, but you stay fully citable in AI answers. And the common fear that this tanks your SEO is simply false — Google and Apple say so in their own documentation.

Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal in Google Search.
— Google Search Central documentation

Apple is just as clear: pages that disallow Applebot-Extended "can still be included in search results." Blocking AI training and staying visible in search aren't a trade-off — they're separate switches.

The "stay citable" robots.txt default

Block the training tokens (GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, Bytespider). Do NOT disallow the search/citation bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot). Result: you feed no model, you keep your Google rankings, and you stay fully citable across ChatGPT, Perplexity and Claude.

Why AI crawlers are now the bigger traffic story

This is no longer a niche concern. AI crawlers have quietly become the dominant non-human traffic on the web — and for many sites they now out-request the search bots everyone built their SEO around.

3.6x

OpenAI's ChatGPT-User alone made 3.6x more requests than Googlebot — across 24.4M requests on 78,000+ pages over 55 days (Alli AI, via Search Engine Journal, 2026).

That shift is exactly why AI-crawler access is now a first-order AI SEO decision, not a footnote. If the bots that build the answer engines can't read you, you don't exist in the answers. And if you optimize for them properly, the upside is measurable: the peer-reviewed GEO paper found that generative-engine optimization methods — adding citations, statistics and quotable structure — can lift a source's visibility in AI answers by up to 40%.

One emerging convention worth knowing: llms.txt, a proposed standard that hands AI engines a curated Markdown map of your most important pages, adopted by Anthropic, Cloudflare and Vercel. It's no guarantee, but it's low-effort and signals the same intent as opening the door to the right crawlers.

How to get your AI-crawler setup right

1**Audit your robots.txt first.** Confirm you're not accidentally disallowing OAI-SearchBot, Claude-SearchBot or PerplexityBot — that's the silent killer of AI visibility.
2**Decide your training policy.** Most sites block training (GPTBot, ClaudeBot, anthropic-ai, Google-Extended, Applebot-Extended, Bytespider) while keeping search and citations open.
3**Keep the citation bots open.** Allow the search/index and user-fetch bots so engines can retrieve, cite and link you.
4**Verify by IP, not user-agent.** When you see "GPTBot" in your logs, check it against OpenAI's published IP ranges before trusting it. Don't block by IP — block by name.
5**Watch what the crawlers actually do.** Your server logs are ground truth: which AI bots hit you, how often, and which answers send real referral clicks back. That first-party signal tells you whether your access policy is working.

Not sure whether AI crawlers can read you — or whether you're accidentally blocking the bots that make you citable? Run a free, one-page audit. It checks AI-crawler access, entity recognition and answer-readiness in about 15 seconds.

Run a free AI audit

That last step is where measurement closes the loop. SourceWatch tracks whether ChatGPT, Perplexity, Gemini and Claude actually cite your brand — your AI visibility and share of voice against named competitors — and pairs it with the first-party AI-crawler and AI-referral traffic landing on your site. So you can see, not guess, whether the right bots are reading you and turning into citations. There's also an MCP server for Claude Code, so you can check it without leaving your editor.

Common misconceptions

**"Block all AI bots to protect my content."** Blocking the search/index bots makes you uncitable in ChatGPT, Perplexity and Claude. The right move is granular: block training, keep search and citations open.
**"Google-Extended and Applebot-Extended are crawlers I can see in my logs."** They're control tokens, not bots — they make no requests and never appear as a fetching user-agent.
**"Blocking AI training will tank my SEO."** False, per Google's and Apple's own docs. Search inclusion and rankings are unaffected.
**"robots.txt stops everything."** User-triggered fetchers (ChatGPT-User, Perplexity-User) can bypass it by design, and bad actors like Bytespider may ignore it entirely. It's a request, not a firewall.
**"I'll match the user-agent string to verify a bot."** User-agents are trivially spoofed. Only the operator's published IP ranges actually verify a crawler.

Frequently asked questions

What is an AI crawler?

An AI crawler is an automated bot run by an AI company that fetches your web pages for one of three jobs: to train a model on your content (e.g. GPTBot, ClaudeBot), to index your content so the AI can cite it in answers (e.g. OAI-SearchBot, PerplexityBot), or to fetch a page live when a person asks the assistant about it (e.g. ChatGPT-User). These three jobs are controlled separately in your robots.txt.

Should I block AI crawlers?

Almost never all of them. Blocking the search/index bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) makes you uncitable — you disappear from ChatGPT, Perplexity and Claude answers. The setup most businesses want is granular: block the training bots so you don't feed the models for free, but keep the search/citation bots open so you stay visible in AI answers.

Does blocking AI crawlers hurt my Google rankings?

No. Blocking AI-training tokens does not affect Google Search. Google states verbatim that "Google-Extended does not impact a site's inclusion in Google Search nor is it used as a ranking signal," and Apple says pages disallowing Applebot-Extended "can still be included in search results." Search inclusion and AI-training opt-out are separate switches.

Source: Google Search Central — Google's common crawlers

Are Google-Extended and Applebot-Extended crawlers?

No. They are robots.txt control tokens, not bots. They make no HTTP requests and never appear in your server logs as a fetching user-agent. They only govern whether content Google or Apple already crawled may be used for AI training and grounding.

Source: Apple Support — Applebot and model training

Will robots.txt stop every AI crawler?

No. The training and index bots honor robots.txt, but user-triggered fetchers like ChatGPT-User and Perplexity-User can ignore it by design, because a human requested the page in real time. Some bad actors, such as Bytespider, may ignore robots.txt entirely. robots.txt is a polite request, not a firewall.

Source: OpenAI — Overview of OpenAI crawlers

How do I verify an AI crawler is real and not a spoof?

Check its IP address against the operator's published ranges — not the user-agent string, which anyone can fake. OpenAI publishes gptbot.json, searchbot.json and chatgpt-user.json; Perplexity and Anthropic publish their own crawler IP lists. Don't block by IP, though — Anthropic warns that IP blocking can stop them reading your robots.txt and won't reliably guarantee an opt-out. Block by user-agent, verify by IP.

Source: Anthropic — Does Anthropic crawl the web, and how can owners block it?

See which AI bots read you — and whether they turn into citations.

Connect your first site and watch SourceWatch score your AI visibility in minutes.