CORE Dimension · Free

Training Access

Whether AI training crawlers like GPTBot, ClaudeBot, and PerplexityBot can read your site. Free CORE dimension — included on every audit, no signup needed.

Overview

What Training Access measures

Training Access tests whether your robots.txt allows AI training crawlers to read your content. These crawlers fetch your pages over time to feed AI knowledge models — if they're blocked, AI eventually forgets you.

It's the most asymmetric CORE dimension: a single line in robots.txt can drop your score by 60+ points. The good news is the fix is also one line.

What We Check

The 10 AI training crawlers

Training Access checks robots.txt allow/deny status for each of these user-agents. Each blocked crawler counts against the score.

  • smart_toy GPTBot (OpenAI) — feeds ChatGPT's training data and powers some live retrieval. Highest-impact bot to allow.
  • smart_toy ClaudeBot (Anthropic) — the Anthropic training crawler, distinct from the Claude chat product. Used to build Claude's world model.
  • smart_toy PerplexityBot (Perplexity) — Perplexity blends live search with offline indexation; this bot fetches the offline half.
  • smart_toy Google-Extended — Google's AI training opt-out signal. Google AI Overviews and Gemini training honour it.
  • smart_toy CCBot (Common Crawl) — the open dataset that seeds most LLM training. Blocking CCBot effectively removes you from the open AI training corpus.
  • smart_toy Bytespider (ByteDance / TikTok) — trains TikTok's recommendation and AI models. Important if your audience overlaps with TikTok demographics.
  • smart_toy Applebot (Apple) — powers Apple Intelligence and Siri's web grounding.
  • smart_toy Amazonbot (Amazon) — feeds Alexa's knowledge graph and Amazon's product-AI assistants like Rufus.
  • smart_toy FacebookBot / Meta-ExternalAgent — Meta AI training (Llama, Meta AI assistant in Instagram/WhatsApp).
  • smart_toy Cohere-AI — Cohere's enterprise LLM training crawler.
Common Problems

Common issues

These are the most common reasons Training Access fails.

  • block Default-deny robots.txt — Some hosting platforms (Vercel previews, staging environments) ship a robots.txt with Disallow: / for all bots. If that file ever leaks to production, every AI crawler is blocked.
  • cookie Cookie / consent walls — Cookie walls hit Training Access especially hard because crawlers can't accept cookies. Render meaningful content above the wall, or detect crawler user-agents and serve uncloaked HTML.
  • verified_user Cloudflare bot-fight mode — "Super Bot Fight Mode" or aggressive WAF rules can block AI crawlers even when robots.txt allows them. Verify by trying to fetch your homepage with curl -A "GPTBot".
  • vpn_lock IP-based blocking — Some sites block whole AS numbers (cloud provider ranges) at the firewall level. AI crawlers often run from cloud IPs, so even a permissive robots.txt won't help if the network drops their packets.
  • missing_controller No robots.txt at all — While "no robots.txt" technically allows everything, some crawlers treat its absence as a yellow flag. Always ship an explicit robots.txt with Allow: / rules for each AI bot you want to opt in.
Taking Action

How to improve your Training Access score

A perfect Training Access score takes about 5 minutes to ship. Here's the order of operations.

  • edit_note Add explicit allow blocks for each AI training bot. In your robots.txt, add a section like: User-agent: GPTBot
    Allow: /

    User-agent: ClaudeBot
    Allow: /

    User-agent: PerplexityBot
    Allow: /

    User-agent: Google-Extended
    Allow: /
    Repeat for each bot you want to opt in.
  • cookie_off Render meaningful content above the cookie wall. If you must show a consent banner, make sure the page's H1, opening paragraph, and key facts are present in the initial HTML before the banner overlays. Crawlers stop at the wall.
  • verified Whitelist crawler IPs in your WAF. Cloudflare, Fastly, and AWS WAF all support known-bot allowlists. Enable the AI-bot whitelist or manually allow OpenAI, Anthropic, Perplexity, and Google IP ranges.
  • terminal Verify with curl. After deploying, test with curl -I -A "GPTBot" https://yoursite.com/. Look for 200 OK. Anything else (403, 503, redirect to a captcha) means the bot is still blocked.
  • refresh Re-run the free access check. Wait 24 hours for Cloudflare's edge cache to clear, then re-audit. Training Access updates within minutes once the new robots.txt is live.
Related

Why Training Access and Agent Access can disagree

It's normal to have a high Training Access score and a low Agent Access score (or vice versa) — they test very different things.

  • policy Different rules of engagement. Training crawlers respect robots.txt and other policy signals; live AI agents fetch on behalf of a real user and ignore those. So a policy fix that opens you up to training crawlers may have no effect on Agent Access — and vice versa.
  • timer Different timescales of impact. Training Access changes percolate slowly — your fix today shows up in model behaviour weeks or months from now, after the next training refresh. Agent Access changes are felt immediately on the next real-time query against your site.
  • cookie Different failure modes. Training Access fails on things training pipelines can't tolerate (broad bot-deny rules, cookie/consent walls, IP-level restrictions). Agent Access fails on things real-time fetches can't tolerate (slow responses, aggressive bot challenges, broken redirect chains, content gated behind modals).
FAQ

Frequently asked questions

  • Is Training Access really free? What's the catch? expand_more
    Yes, no catch. Training Access plus Agent Access make up the free access check — we run them on any URL with no signup. The 8 paid dimensions (the 3 deeper CORE dims and the 5 AURA dims) stay locked in the report until you spend a $29 audit credit. We give the bot-allowance check away because it's the single most actionable insight and it's also the most-asked question we get.
  • Should I block ClaudeBot but allow GPTBot? expand_more
    No, that's almost always counter-productive. The bots feed different AI products, and most users today consult multiple AI assistants. Blocking one bot just removes you from one product's training data — it doesn't protect any "exclusivity" because none of the AI products pay for content based on bot allowance. The exception is if your legal team has a specific compliance reason to block a vendor.
  • What's the difference between ClaudeBot and Claude (the chat product)? expand_more
    ClaudeBot is the Anthropic training crawler — it indexes the web to feed Claude's model. Claude (the chat product) is what users interact with. They're related but separate: blocking ClaudeBot affects what Claude knows long-term, but Claude can still browse your live site if a user asks it to (subject to its retrieval system, not robots.txt).
  • How quickly does Training Access reflect a robots.txt change? expand_more
    AIVerdict re-fetches robots.txt on every audit (no caching). So as soon as your new robots.txt is live at the edge (~5 minutes after deploy on most CDNs), the next free access check will reflect it. Note that the bots themselves may take days to re-crawl — AI Brand Recognition lags Training Access by weeks-to-months.

Run a free Training Access check

No signup. Get your Training Access + Agent Access score in 10–30 seconds, then unlock the full 10-dimension audit for $29 when you're ready.

Run Free Check