Training Access
Whether AI training crawlers like GPTBot, ClaudeBot, and PerplexityBot can read your site. Free CORE dimension — included on every audit, no signup needed.
What Training Access measures
Training Access tests whether your robots.txt allows AI training crawlers to read your content. These crawlers fetch your pages over time to feed AI knowledge models — if they're blocked, AI eventually forgets you.
It's the most asymmetric CORE dimension: a single line in robots.txt can drop your score by 60+ points. The good news is the fix is also one line.
The 10 AI training crawlers
Training Access checks robots.txt allow/deny status for each of these user-agents. Each blocked crawler counts against the score.
- GPTBot (OpenAI) — feeds ChatGPT's training data and powers some live retrieval. Highest-impact bot to allow.
- ClaudeBot (Anthropic) — the Anthropic training crawler, distinct from the Claude chat product. Used to build Claude's world model.
- PerplexityBot (Perplexity) — Perplexity blends live search with offline indexation; this bot fetches the offline half.
- Google-Extended — Google's AI training opt-out signal. Google AI Overviews and Gemini training honour it.
- CCBot (Common Crawl) — the open dataset that seeds most LLM training. Blocking CCBot effectively removes you from the open AI training corpus.
- Bytespider (ByteDance / TikTok) — trains TikTok's recommendation and AI models. Important if your audience overlaps with TikTok demographics.
- Applebot (Apple) — powers Apple Intelligence and Siri's web grounding.
- Amazonbot (Amazon) — feeds Alexa's knowledge graph and Amazon's product-AI assistants like Rufus.
- FacebookBot / Meta-ExternalAgent — Meta AI training (Llama, Meta AI assistant in Instagram/WhatsApp).
- Cohere-AI — Cohere's enterprise LLM training crawler.
Common issues
These are the most common reasons Training Access fails.
-
Default-deny robots.txt — Some hosting platforms (Vercel previews, staging environments) ship a robots.txt with
Disallow: /for all bots. If that file ever leaks to production, every AI crawler is blocked. - Cookie / consent walls — Cookie walls hit Training Access especially hard because crawlers can't accept cookies. Render meaningful content above the wall, or detect crawler user-agents and serve uncloaked HTML.
-
Cloudflare bot-fight mode — "Super Bot Fight Mode" or aggressive WAF rules can block AI crawlers even when robots.txt allows them. Verify by trying to fetch your homepage with
curl -A "GPTBot". - IP-based blocking — Some sites block whole AS numbers (cloud provider ranges) at the firewall level. AI crawlers often run from cloud IPs, so even a permissive robots.txt won't help if the network drops their packets.
-
No robots.txt at all — While "no robots.txt" technically allows everything, some crawlers treat its absence as a yellow flag. Always ship an explicit robots.txt with
Allow: /rules for each AI bot you want to opt in.
How to improve your Training Access score
A perfect Training Access score takes about 5 minutes to ship. Here's the order of operations.
-
Add explicit allow blocks for each AI training bot. In your robots.txt, add a section like:
User-agent: GPTBotRepeat for each bot you want to opt in.
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: / - Render meaningful content above the cookie wall. If you must show a consent banner, make sure the page's H1, opening paragraph, and key facts are present in the initial HTML before the banner overlays. Crawlers stop at the wall.
- Whitelist crawler IPs in your WAF. Cloudflare, Fastly, and AWS WAF all support known-bot allowlists. Enable the AI-bot whitelist or manually allow OpenAI, Anthropic, Perplexity, and Google IP ranges.
-
Verify with curl. After deploying, test with
curl -I -A "GPTBot" https://yoursite.com/. Look for200 OK. Anything else (403, 503, redirect to a captcha) means the bot is still blocked. - Re-run the free access check. Wait 24 hours for Cloudflare's edge cache to clear, then re-audit. Training Access updates within minutes once the new robots.txt is live.
Why Training Access and Agent Access can disagree
It's normal to have a high Training Access score and a low Agent Access score (or vice versa) — they test very different things.
- Different rules of engagement. Training crawlers respect robots.txt and other policy signals; live AI agents fetch on behalf of a real user and ignore those. So a policy fix that opens you up to training crawlers may have no effect on Agent Access — and vice versa.
- Different timescales of impact. Training Access changes percolate slowly — your fix today shows up in model behaviour weeks or months from now, after the next training refresh. Agent Access changes are felt immediately on the next real-time query against your site.
- Different failure modes. Training Access fails on things training pipelines can't tolerate (broad bot-deny rules, cookie/consent walls, IP-level restrictions). Agent Access fails on things real-time fetches can't tolerate (slow responses, aggressive bot challenges, broken redirect chains, content gated behind modals).
Frequently asked questions
-
Is Training Access really free? What's the catch?Yes, no catch. Training Access plus Agent Access make up the free access check — we run them on any URL with no signup. The 8 paid dimensions (the 3 deeper CORE dims and the 5 AURA dims) stay locked in the report until you spend a $29 audit credit. We give the bot-allowance check away because it's the single most actionable insight and it's also the most-asked question we get.
-
Should I block ClaudeBot but allow GPTBot?No, that's almost always counter-productive. The bots feed different AI products, and most users today consult multiple AI assistants. Blocking one bot just removes you from one product's training data — it doesn't protect any "exclusivity" because none of the AI products pay for content based on bot allowance. The exception is if your legal team has a specific compliance reason to block a vendor.
-
What's the difference between ClaudeBot and Claude (the chat product)?ClaudeBot is the Anthropic training crawler — it indexes the web to feed Claude's model. Claude (the chat product) is what users interact with. They're related but separate: blocking ClaudeBot affects what Claude knows long-term, but Claude can still browse your live site if a user asks it to (subject to its retrieval system, not robots.txt).
-
How quickly does Training Access reflect a robots.txt change?AIVerdict re-fetches robots.txt on every audit (no caching). So as soon as your new robots.txt is live at the edge (~5 minutes after deploy on most CDNs), the next free access check will reflect it. Note that the bots themselves may take days to re-crawl — AI Brand Recognition lags Training Access by weeks-to-months.
Run a free Training Access check
No signup. Get your Training Access + Agent Access score in 10–30 seconds, then unlock the full 10-dimension audit for $29 when you're ready.
Run Free Check