title:: AI Crawler User Agent Strings: Complete Reference Table description:: Complete reference table of every known AI crawler user agent string. Includes GPTBot, ClaudeBot, Bytespider, Google-Extended, and 30+ other AI bot identifiers. focus_keyword:: ai crawler user agent strings list category:: crawlers author:: Victor Valentine Romo date:: 2026.03.20
AI Crawler User Agent Strings: Complete Reference Table
Quick Summary
- What this covers: ai-crawler-user-agent-strings
- Who it's for: publishers and site owners managing AI bot traffic
- Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.
Every AI crawler identifies itself through a user-agent string in HTTP request headers. These strings are the first line of identification — the label that tells your server who's knocking. Without an accurate, current reference of AI crawler user-agent strings, you're blind to which AI companies are extracting your content and how much they're taking.
This reference table covers every documented AI crawler user-agent string as of early 2026. It serves as the foundation for server-level blocking, analytics dashboards, and monetization configurations. Bookmark it. Your robots.txt and Nginx configurations depend on it.
Major AI Training Crawlers
These crawlers collect content for permanent incorporation into AI model training datasets.
OpenAI Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
GPTBot/1.0 (+https://openai.com/gptbot) |
Model training and pre-indexing | High | Yes (Pay-Per-Crawl) |
ChatGPT-User |
Real-time browsing for ChatGPT | High | Limited |
OAI-SearchBot/1.0 |
SearchGPT web retrieval | High | Emerging |
GPTBot is the primary training crawler. ChatGPT-User is the real-time retrieval agent. OAI-SearchBot serves OpenAI's search product. Each requires separate robots.txt directives for independent control.
Full profile: GPTBot Crawler Profile
Anthropic Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
ClaudeBot/1.0 (+https://anthropic.com/claudebot) |
Model training data | Very high | Yes (Pay-Per-Crawl) |
ClaudeBot-User/1.0 (+https://anthropic.com/claudebot) |
Real-time retrieval | Very high | Limited |
ClaudeBot demonstrates the highest compliance rate among major AI crawlers. Separate directives for ClaudeBot and ClaudeBot-User enable independent control of training vs. retrieval.
Full profile: ClaudeBot Crawler Profile
Google Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
Googlebot (standard) |
Search indexing | Very high | N/A (search) |
Google-Extended |
AI training (Gemini) | High | Licensing deals |
Google-Extended is a permission token, not a separate crawler. It shares infrastructure with Googlebot. Blocking Google-Extended prevents AI training use without affecting search indexing.
Full profile: Google-Extended Crawler Profile
ByteDance Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
Bytespider |
AI training (Doubao, TikTok) | Very low | No |
Mozilla/5.0 (compatible; Bytespider; [email protected]) |
Same (extended format) | Very low | No |
Bytespider frequently spoofs its user-agent string, appearing as standard browsers. User-agent detection alone is insufficient — combine with IP/ASN blocking.
Full profile: Bytespider Crawler Profile
Meta Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
Meta-ExternalAgent/1.0 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) |
AI training (LLaMA) | Moderate-high | Limited |
facebookexternalhit/1.1 |
Social sharing previews | N/A | N/A (keep allowed) |
Facebot |
Social features | N/A | N/A (keep allowed) |
Block Meta-ExternalAgent for AI training opt-out. Do NOT block facebookexternalhit or Facebot — these handle link previews on Facebook and Instagram.
Full profile: Meta AI Crawler Profile
Amazon Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
Amazonbot/0.1 (https://developer.amazon.com/support/amazonbot) |
AI training (Alexa, Rufus, Q) | High | Limited |
Full profile: Amazonbot Crawler Profile
Apple Crawlers
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
Applebot-Extended |
AI training (Apple Intelligence) | High | Limited |
Applebot/0.1 (+http://www.apple.com/go/applebot) |
Siri/Spotlight (keep allowed) | Very high | N/A |
Block Applebot-Extended for AI training opt-out. Keep Applebot allowed for Siri knowledge features.
Full profile: Applebot-Extended Crawler Profile
Common Crawl
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
CCBot/2.0 (https://commoncrawl.org/faq/) |
Open training datasets | High | No (nonprofit) |
Blocking CCBot denies training data to dozens of AI companies simultaneously. The highest-leverage single block.
Full profile: CCBot Profile
AI Search and Retrieval Crawlers
These crawlers fetch content for real-time query answering rather than permanent model training.
Perplexity
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
PerplexityBot |
AI search retrieval | Disputed | Limited |
PerplexityBot has faced scraping controversies over robots.txt compliance and content attribution.
Cohere
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
cohere-ai |
Enterprise AI retrieval | Moderate | Limited |
Cohere operates primarily in enterprise RAG deployments.
You.com
| User-Agent String | Purpose | Compliance | Monetizable |
|---|---|---|---|
YouBot |
AI search | Moderate | No |
Emerging and Specialized AI Crawlers
Mistral
| User-Agent String | Purpose | Compliance |
|---|---|---|
MistralBot |
Model training | Moderate |
AI21 Labs
| User-Agent String | Purpose | Compliance |
|---|---|---|
AI2Bot |
Research and training | High |
Hugging Face
| User-Agent String | Purpose | Compliance |
|---|---|---|
HuggingFaceBot |
Model training datasets | Moderate |
DeepSeek
| User-Agent String | Purpose | Compliance |
|---|---|---|
Deepseekbot |
Model training | Low-Moderate |
Baidu (ERNIE)
| User-Agent String | Purpose | Compliance |
|---|---|---|
Baiduspider |
Search + AI training | Moderate |
Note: Baiduspider serves both traditional search and AI training for Baidu's ERNIE models. Blocking it may affect your visibility in Baidu search (relevant for Chinese audience).
Others
| User-Agent String | Operator | Purpose |
|---|---|---|
Diffbot |
Diffbot | Knowledge graph construction |
Webzio-Extended |
Webz.io | Data feeds for AI companies |
Scrapy |
Various | Generic scraping framework (not specific to one company) |
DataForSeoBot |
DataForSEO | SEO data + AI features |
SemrushBot |
Semrush | SEO data + AI features |
AhrefsBot |
Ahrefs | SEO data + AI features |
PetalBot |
Huawei | Search + AI (Petal Search) |
ImagesiftBot |
Imagesift | Image training data |
Detection Patterns for Server Configuration
Nginx Map for All Known AI Crawlers
map $http_user_agent $is_ai_crawler {
default 0;
~*GPTBot 1;
~*ChatGPT-User 1;
~*OAI-SearchBot 1;
~*ClaudeBot 1;
~*Bytespider 1;
~*bytedance 1;
~*Google-Extended 1;
~*Meta-ExternalAgent 1;
~*Amazonbot 1;
~*Applebot-Extended 1;
~*CCBot 1;
~*PerplexityBot 1;
~*cohere-ai 1;
~*YouBot 1;
~*MistralBot 1;
~*AI2Bot 1;
~*Deepseekbot 1;
~*Diffbot 1;
~*PetalBot 1;
~*ImagesiftBot 1;
}
Use $is_ai_crawler in conditional blocks for blanket AI crawler management.
Separate Training vs. Search Classification
# Training crawlers
map $http_user_agent $is_ai_training_crawler {
default 0;
~*GPTBot 1;
~*ClaudeBot/1 1;
~*Google-Extended 1;
~*Meta-ExternalAgent 1;
~*Bytespider 1;
~*bytedance 1;
~*CCBot 1;
~*Amazonbot 1;
~*Applebot-Extended 1;
~*MistralBot 1;
~*Deepseekbot 1;
}
# Search/retrieval crawlers
map $http_user_agent $is_ai_search_crawler {
default 0;
~*ChatGPT-User 1;
~*ClaudeBot-User 1;
~*PerplexityBot 1;
~*cohere-ai 1;
~*YouBot 1;
~*OAI-SearchBot 1;
}
This enables the dual strategy: block training, allow search.
Apache .htaccess Pattern
# Block all known AI training crawlers
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot/1|Google-Extended|Meta-ExternalAgent|Bytespider|bytedance|CCBot|Amazonbot|Applebot-Extended|MistralBot|Deepseekbot) [NC]
RewriteRule .* - [F,L]
Keeping This List Current
Why User-Agent Lists Decay
AI crawler user-agent strings change. New AI companies emerge. Existing companies deploy new crawlers. User-agent formats evolve. A list accurate in January 2026 will have gaps by June 2026.
Sources for updates:
- Cloudflare Radar — Publishes bot traffic data including new user agents
- Dark Visitors (darkvisitors.com) — Community-maintained AI crawler database
- Server log analysis — Your own logs reveal crawlers not yet publicly documented
- AI company documentation — Official crawler pages from OpenAI, Anthropic, Google, etc.
- Publisher forums and trade publications — Early reports of new crawler activity
Monthly Audit Process
- Review access logs for unrecognized user agents with high request volumes
- Cross-reference new agents against known AI company IP ranges
- Check behavioral patterns (systematic crawling vs. legitimate browser behavior)
- Update Nginx maps, robots.txt, and CDN rules with new entries
- Remove deprecated entries (crawlers that no longer operate)
The AI crawler audit walkthrough provides the complete step-by-step process.
Spoofing and Verification
The Trust Problem
User-agent strings are self-reported. Any HTTP client can claim any identity. A scraper can set its user-agent to Googlebot and your server would see a request apparently from Google. This makes user-agent matching necessary but insufficient.
Verification Methods by Crawler
| Crawler | Verification Method |
|---|---|
| GPTBot | IP range check (20.15.240.x ranges) |
| ClaudeBot | IP range check (160.79.104.0/23) |
| Googlebot | Reverse DNS (*.googlebot.com) |
| Applebot | Reverse DNS (*.applebot.apple.com) |
| Bingbot | Reverse DNS (*.search.msn.com) |
| Bytespider | ASN check (AS396986, AS138294) |
| Most others | No official verification method |
For crawlers without official IP ranges, behavioral analysis provides secondary verification. Legitimate AI crawlers exhibit consistent patterns — systematic access, regular intervals, coverage of content sections. Spoofed crawlers often show erratic patterns — random pages, irregular timing, combined with other suspicious activity.
The IP verification guide covers verification methods in detail.
When Blocking AI Crawlers Isn't the Move
Skip this if:
- Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
- You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
- Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.
Frequently Asked Questions
How often do AI crawler user-agent strings change?
Rarely for major crawlers. GPTBot, ClaudeBot, and CCBot have maintained stable user-agent strings since launch. New versions or format changes are typically documented by the operating company. The bigger risk is entirely new crawlers appearing from new AI companies — these require log monitoring to detect.
Should I block all AI crawlers listed here?
Not necessarily. The publisher decision framework helps determine which crawlers to block, which to monetize, and which to allow. Blanket blocking forfeits all AI licensing revenue. Selective blocking and monetization maximize both protection and revenue.
What if a crawler doesn't identify itself with any known user-agent?
Unidentified crawlers require behavioral detection. High request volumes, systematic access patterns, absence of CSS/JS/image requests, and requests from data center IP ranges (rather than residential or mobile) suggest bot activity. Server log analysis and CDN bot management tools help identify these unlabeled crawlers.
Can I use this list for Cloudflare firewall rules?
Yes. Create a Cloudflare WAF custom rule matching user-agent strings from this table. The rule can block, challenge, or log matching requests. For Pay-Per-Crawl publishers, Cloudflare's built-in AI crawler detection handles identification automatically — but manual rules provide backup coverage for crawlers not yet in Cloudflare's database.
Where can I find real-time updates to this list?
Monitor darkvisitors.com for community-maintained updates, Cloudflare Radar for bot traffic trends, and individual AI company documentation pages for official changes. Your own server logs are the most authoritative source — they show exactly which user agents are hitting your specific domain.