The Complete AI Crawler Directory: Identification, Behavior, and Blocking Instructions

Server logs tell a story most publishers miss. Mixed among legitimate traffic and search engine bots sits a different category of crawler altogether. These bots don't index your content for search results. They ingest it for training datasets, retrieval pipelines, and inference systems powering the next generation of large language models.

The difference matters. Googlebot crawls to help users find your content. GPTBot crawls to help OpenAI build products that compete with your content. Search indexing and AI training serve fundamentally different economic functions.

Understanding which crawlers hit your domain, how they behave, and what options exist for blocking or monetizing them forms the foundation of any AI licensing strategy. You can't price what you can't identify. You can't negotiate from strength when you don't know who's at the table.

Why AI Crawlers Are Different From Search Crawlers

Training Crawls vs. Retrieval Crawls vs. Search Indexing

Not all web crawling serves the same purpose. The distinction shapes strategy.

Search indexing creates lookup tables. Googlebot reads your page, extracts signals, and stores enough information to return your URL when someone searches relevant queries. Your content helps users find you. The crawler and your interests align.

Training crawls build permanent knowledge. When ClaudeBot or GPTBot scrapes your archive, that content may enter model weights during the next training run. Your expertise becomes part of the AI's capability. Once trained, no referral returns. The knowledge transfers completely.

Retrieval crawls feed real-time lookups. Perplexity's bot and Google-Extended (for AI Overviews) fetch content during inference to answer specific queries. Your content gets summarized, cited (sometimes), and used to generate responses that often eliminate the need to visit your site.

Training crawls happen less frequently but scrape deeper into archives. Retrieval crawls happen constantly but touch fewer pages. Pricing and blocking strategies should reflect this distinction.

The Economics of 73,000 Scrapes Per Referral

Anthropic's ClaudeBot scraped one major publisher 73,000 times for every single referral it sent back. The number comes from server log analysis shared at a 2025 publishing industry conference. Other AI companies show similar ratios.

The math doesn't work like search. Google might crawl your page 10 times and send you 1,000 visitors. The crawling enables discovery. AI crawlers extract value without creating traffic. They take everything and return almost nothing.

This asymmetry explains why 75% of publishers now block CCBot, which feeds the Common Crawl dataset used to train most large language models. 69% block ClaudeBot. 62% block GPTBot. Publishers recognized the economics and responded.

Blocking stops extraction. It doesn't capture value. The shift toward Cloudflare Pay-Per-Crawl and RSL protocol licensing reflects publishers wanting compensation rather than protection.

Why Blocking AI Crawlers Doesn't Harm Traditional SEO

The fear is common: block AI crawlers, lose search rankings. The evidence doesn't support it.

GPTBot and Googlebot are separate crawlers with separate user-agent strings and separate robots.txt entries. Blocking one has no effect on the other. OpenAI has no influence over Google's search algorithm. Their systems don't communicate.

Google-Extended handles Gemini training and AI Overview generation. Blocking it prevents your content from appearing in AI-generated summaries. It does not affect traditional organic rankings. Google's own documentation confirms this separation.

Case study data from 50+ publishers shows zero correlation between AI crawler blocking and organic traffic changes. Sites that blocked GPTBot, ClaudeBot, and Google-Extended maintained identical search performance.

The Major AI Crawlers (Detailed Profiles)

GPTBot (OpenAI)

User-Agent: GPTBot/1.0 (+https://openai.com/gptbot)

Primary Purpose: Training data collection, retrieval for ChatGPT

robots.txt Compliance: Yes, generally respects directives

Documented IP Ranges:

20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28

Blocking via robots.txt:

User-agent: GPTBot
Disallow: /

OpenAI has demonstrated willingness to negotiate licensing arrangements and pay for content through Cloudflare Pay-Per-Crawl.

ClaudeBot (Anthropic)

User-Agent: ClaudeBot/1.0 (+https://anthropic.com/claudebot)

Primary Purpose: Training data, retrieval for Claude

robots.txt Compliance: Yes, highly compliant

Documented IP Ranges:

160.79.104.0/23

Blocking via robots.txt:

User-agent: ClaudeBot
Disallow: /

Anthropic is among the most willing AI companies to pay via marketplace mechanisms. Multiple Cloudflare Pay-Per-Crawl publishers report ClaudeBot compliance with pricing terms without negotiation.

Google-Extended (Google Gemini)

User-Agent: Google-Extended

Primary Purpose: Gemini training, AI Overviews content retrieval

robots.txt Compliance: Yes

Google-Extended is distinct from Googlebot. This separation exists specifically so publishers can allow search indexing while blocking AI training usage.

robots.txt Configuration:

User-agent: Googlebot
Allow: /

User-agent: Google-Extended
Disallow: /

This configuration permits search indexing while preventing AI training usage.

Bytespider (ByteDance/TikTok)

User-Agent: Bytespider

Primary Purpose: Training data for Chinese LLMs including Doubao

robots.txt Compliance: Inconsistent to non-existent

Bytespider represents the enforcement challenge publishers face. This crawler frequently ignores robots.txt directives entirely. Server log analysis shows 10,000-20,000+ daily requests from Bytespider even on sites with explicit blocking.

Enforcement Options:

robots.txt alone fails. Effective blocking requires server-level or CDN-level intervention:

Nginx:

if ($http_user_agent ~* "Bytespider") {
    return 403;
}

Cloudflare Firewall Rule:

(http.user_agent contains "Bytespider")

Action: Block

No payment or licensing pathway exists for Bytespider. The only option is blocking.

CCBot (Common Crawl)

User-Agent: CCBot/2.0 (https://commoncrawl.org/faq/)

Primary Purpose: Building open training datasets used by multiple AI companies

robots.txt Compliance: Yes

CCBot deserves special attention because its data feeds into nearly every major LLM. OpenAI, Anthropic, Meta, and others have all used Common Crawl datasets for training. Blocking CCBot indirectly reduces your content's presence in multiple AI systems simultaneously.

Current data shows 75% of surveyed publishers block CCBot.

Blocking Strategies: robots.txt vs. Server-Level vs. Cloudflare

Comprehensive robots.txt Block List

# AI Training Crawlers - Blocked

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: PerplexityBot
Disallow: /

# Search Indexing - Allowed

User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

Cloudflare Firewall Rules

Cloudflare provides the most flexible approach, combining detection, blocking, and monetization options.

Blocking rule:

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "CCBot")

Action: Block

Cloudflare Pay-Per-Crawl integrates with these rules. Rather than blocking compliant crawlers like GPTBot and ClaudeBot, you can require payment for access.

Monitoring Crawler Activity Over Time

Quarterly Audit Checklist

Every three months, review:

New user-agents appearing in logs at significant frequency
IP ranges associated with high-volume unknown traffic
Behavioral patterns suggesting undocumented AI crawlers
Industry reports on newly launched AI company crawlers
Pricing effectiveness of current monetization approach
Blocking effectiveness against non-compliant crawlers

The AI crawler landscape evolves faster than documentation. New crawlers from Chinese AI companies, European startups, and academic institutions appear constantly. Quarterly audits catch new entrants before they extract significant value.

This guide is part of the AI Pay Per Crawl implementation series. For related content, see Cloudflare Pay-Per-Crawl Setup and RSL Protocol Implementation.