First, decide what you're actually tracking
Two things get lumped together as "showing up in AI," and they're different signals you should track separately. A **brand mention** is when the AI names you in the answer text — "tools like Acme and Globex are popular choices." A **citation** is when the AI explicitly credits your site as a source, usually with a clickable link. You can be mentioned constantly without ever being cited, and occasionally cited on a page that never names you in the prose.
Mentions tell you whether the model *knows and recommends* you. Citations tell you whether your *content* is trusted enough to be sourced — and they're the only one of the two that can send you traffic. Track both, on the same prompts, so you can see the gap. The deeper methodology for the citation side lives in our AI citation tracking guide; this page is about the mention side and the system that ties them together.
Only one engine reliably converts visibility to traffic
Perplexity links every source it uses, so its mentions actually produce trackable referral clicks — you'll see them in GA4 under Acquisition → Referral, filtered to perplexity.ai. ChatGPT and Gemini cite less consistently and pass far less referral traffic. So for Perplexity you can measure visibility *and* clicks; for the others, visibility is mostly what you've got. Plan your tracking around that asymmetry — and see how to rank in Perplexity for the engine that turns visibility into visits.
Why you can't just "check once" — the rule that breaks everything
This is the single most important thing to understand, and it's the one most tools quietly ignore: AI recommendations are wildly inconsistent. The same prompt, run again a minute later, can return a different set of brands in a different order. Treating any single answer as "the result" is like judging an election from one ballot.
SparkToro ran the experiment properly — 2,961 prompt runs, 600 volunteers, 12 prompts across ChatGPT, Claude and Google AI. The same full list of brands showed up in fewer than 1 in 100 responses. The identical *ordering* of that list appeared in roughly 1 in 1,000 runs. In other words, "rank" barely exists as a stable concept here.
97% vs 35%
In SparkToro's cancer-care test, City of Hope appeared in 69 of 71 ChatGPT responses (97%) — but was the top recommendation only 25 times (35%). Frequency of appearance is meaningful; "position" is mostly noise.
There's a second layer of variance you control even less: how people phrase the question. SparkToro found the semantic similarity between different users asking for the same thing — same intent, same goal — averaged just 0.081. So even one perfectly chosen prompt under-samples how your real customers actually ask. You need both many *runs* of each prompt and many *variations* of the prompt.
The practical takeaway
Run each prompt 60–100 times and report the aggregate visibility % ("we appear in 71% of runs for this query"), not a single position. Any tool selling you a confident "AI ranking position" is showing you a number that won't hold up if you re-run it. Visibility % is the honest metric.
The four metrics worth tracking
Skip the vanity numbers. These four tell you something you can act on:
| Metric | What it answers | How to read it |
|---|---|---|
| Mention / citation frequency | How often do we appear for our target queries? | Tracked per engine — each one cites differently. Use % of runs, not a count. |
| Share of voice | How big is our slice vs. the competitor set? | (Your mentions ÷ total market mentions) × 100. This is the headline number. |
| Sentiment / positioning | Are we framed positively, neutrally, or negatively? | A positive recommendation beats a neutral list-mention. Watch the wording, not just the name. |
| Context / trigger | Which prompts surface us — and which sources did the engine use? | Tells you *why* you appeared, so you know what to reinforce. |
Share of voice is the one number to put on the wall
Raw mention counts lie. As the whole AI ecosystem grows, *everyone's* mention count drifts upward — so a rising number can mean the category got bigger, not that you got better. Share of voice fixes that by giving you a denominator: your mentions as a percentage of all brand mentions across a fixed competitor set and a fixed prompt bank. It's the AI equivalent of organic share of voice, and it's the metric that actually tracks progress over time in an AI visibility tracker.
The formula is simple — `(your brand mentions ÷ total market mentions) × 100` — but it only works if the competitor set and prompt bank stay constant between measurements. Change the inputs and you're comparing two different things.
A three-layer system you can actually run
Here's a credible setup an SMB or a one-person marketing team can stand up this week. Three layers, increasing in automation.
- 1
1. Build a prompt bank (the foundation)
Write down the 30–50 real questions your customers actually ask, phrased as full natural-language prompts — persona + company type + pain point + the question. "Best project management tool for a 12-person design agency that hates Jira" beats "project management software." Treating prompts like keywords is the #1 tracking mistake: generic, keyword-style prompts return brandless, informational answers, so you never see who gets recommended.
- 2
2. Test ~10 prompts weekly, by hand
Rotate about ten prompts a week across ChatGPT, Perplexity and Gemini (add Claude and Copilot if your buyers use them). For each, log: did you appear, which competitors appeared, the sentiment, and which sources the engine cited. Run each prompt several times in a row — you'll watch the variance happen live, which is the fastest way to internalize why one-shot checking is useless.
- 3
3. Automate monthly tracking at scale
Manual testing keeps you honest but can't do 60–100 runs per prompt across five engines. A dedicated tool runs your prompt bank repeatedly and averages the results to cancel out the probabilistic noise — turning "it depends" into a stable visibility % and share-of-voice trend you can report on.
The manual layer is your reality check; the automated layer is your scale. Run both — the weekly hands-on testing is what stops you from blindly trusting a dashboard, and the monthly automation is what defeats the noise a human sampler never could.
Want a 15-second starting point? A free AI SEO audit checks whether the engines can even read and recognize your brand — the precondition for ever being mentioned.
Run a free AI SEO auditLayer in the first-party signals (this is your ground truth)
Prompt testing — manual or automated — is still a sample. The signals below come straight from Google and from your own server, so they're not a synthetic estimate. They're what actually happened. Layer them on top.
Google Search Console — the new AI performance reports
As of June 2026, Search Console has dedicated generative-AI performance reports showing your impressions in AI Overviews and AI Mode (and Discover) — impressions, pages, countries, devices and dates. It's the first first-party way to see your visibility inside Google's AI features. Two caveats: there's no click data yet, and it's rolling out to a subset of sites first, so you may not have it the day you look.
GA4 — referral traffic from the engines
Filter Acquisition → Referral for chatgpt.com, perplexity.ai and gemini.google.com to capture the AI visitors who actually clicked through. Perplexity will dominate here because it links every source; the others trickle. It's a small stream, but it's real people, and ChatGPT-referred visitors have been measured converting well above organic search visitors.
Server logs — are the crawlers even reaching you?
Watch your logs (or your AI-crawler monitoring) for OAI-SearchBot (ChatGPT), PerplexityBot and Google-Extended. This is the most upstream check there is: if the crawlers can't reach and read your pages, you can't be cited, full stop. One important wrinkle — bot user-agents are easy to spoof, so a log line claiming to be "GPTBot" isn't proof it really was. Verified crawler traffic (matched to the official IP ranges) is the signal that counts.
Why first-party beats any tool's sample
A prompt-testing tool tells you what an engine *probably* says. Your server logs and GA4 tell you what an engine *actually did* — which pages it crawled and which clicks it sent. SourceWatch captures that first-party side: real, verified-vs-spoofed AI-crawler and AI-referral traffic hitting your site, alongside your mention and share-of-voice tracking across ChatGPT, Perplexity, Gemini and Claude. For teams in Claude Code, SourceWatch also ships an MCP server so you can pull that data straight into your workflow.
Once you're tracking, here's what moves the number
Tracking without levers is just watching. The peer-reviewed GEO study (Princeton / Georgia Tech / IIT-Delhi, KDD 2024) tested what actually lifts a page's pull into AI answers — and the winners are about *enrichment*, not keywords. These are the core moves behind generative engine optimization:
- **Add quotations** — up to +41% visibility. Clean, liftable statements the model can quote directly.
- **Add statistics** — roughly +33–41%. Specific numbers read as authoritative.
- **Add fluency / clearer writing** — about +29%. Readable, well-structured prose gets pulled more.
- **Cite your own sources** — about +28%. Pages that reference credible sources are treated as more credible.
- **Keyword stuffing — about −8%.** The old SEO trick actively *hurts* in generative engines. Don't.
And per Google's own guidance, the foundation is unique first-hand perspective and people-first content — structured data, llms.txt and "chunking" are *not* required to appear. Track first, find the prompts where competitors get named and you don't, then apply these levers to the pages that should be winning them.
See exactly how often ChatGPT, Perplexity, Gemini and Claude mention and cite you — visibility %, share of voice, and verified AI-crawler traffic in one place.
Track your brand with SourceWatchCommon mistakes
Almost every bad AI-tracking setup makes one of these errors:
- **Treating prompts like keywords.** Generic prompts return brandless answers. Write full, natural questions a real buyer would ask.
- **Sampling once.** AI is probabilistic — one answer is one ballot. Run each prompt 60–100 times and report a visibility %.
- **Tracking raw "total mentions" with no denominator.** The number rises as the whole AI ecosystem grows, not because you improved. Always use share of voice.
- **Tracking only one engine.** ChatGPT and Perplexity overlap on ~11% of sources. Winning one says nothing about the rest.
- **Ignoring stale entity data.** Models often describe brands from outdated training data. Run an entity audit so you're not tracking a wrong description of yourself — start with your AI visibility baseline.
- **Treating it as a one-time audit.** 40–60% of cited sources churn monthly. Tracking is a continuous loop, not a screenshot.
- **Trusting "AI rank position" tools.** Visibility % is meaningful; a confident positional "rank" is invented. If a tool can't reproduce the number on a re-run, it isn't real.