How Often Do AI Crawlers Hit Your Site? Crawl Frequency Benchmarks

Quick Summary

What this covers: AI crawler frequency benchmarks across industries. Request rates, scraping intervals, and volume patterns for GPTBot, ClaudeBot, PerplexityBot, and other training bots.

Who it's for: publishers and site owners managing AI bot traffic

Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.

AI crawlers don't scrape once and disappear. They return. GPTBot requests your article today, revisits next week, scrapes again next month. PerplexityBot hits your homepage hourly. ClaudeBot crawls your sitemap every three days.

Frequency patterns reveal intent. Training bots (GPTBot, CCBot) scrape infrequently—once per content update cycle, focused on ingesting corpus for model training. Answer engines (PerplexityBot) scrape continuously—real-time content retrieval for user queries.

Understanding frequency benchmarks answers critical questions: How much server load should I expect? Is this scraping volume normal or excessive? Do crawlers respect reasonable limits? Should licensing deals include request quotas?

Publishers without frequency data negotiate blind. You agree to "reasonable access" without defining what reasonable means. AI company interprets this liberally—scrapes 50,000 times monthly when industry norm is 5,000. Contract has no enforcement mechanism because you didn't establish baseline.

This guide provides empirical frequency benchmarks across bot types and industries, quantifies normal vs. excessive scraping patterns, and shows how to use frequency data in licensing negotiations.

Measuring Crawl Frequency

Request Rate vs. Visit Frequency

Distinction matters:

Request rate: Requests per second/minute/hour (technical capacity measurement).

Visit frequency: How often crawler returns to site (days/weeks between visits).

Example:

GPTBot visit pattern:

Jan 1: 500 requests (scrapes site)
Jan 8: 450 requests (revisits)
Jan 15: 480 requests (revisits)
Jan 22: 520 requests (revisits)

Visit frequency: Weekly (every 7 days)

Request rate during visit: 500 requests over 2 hours = 4.2 requests/minute

Both metrics matter:

Visit frequency determines content freshness for AI (weekly updates mean AI sees changes within 7 days)
Request rate determines server load (4 req/min is manageable, 400 req/min strains infrastructure)

Calculating Average Requests Per Day

Extract from logs:

# Count AI crawler requests per day (last 30 days)
for bot in "GPTBot" "ClaudeBot" "PerplexityBot"; do
    echo "=== $bot ==="
    grep "$bot" /var/log/nginx/access.log* | \
    awk '{print $4}' | \
    cut -d: -f1 | \
    sort | uniq -c
done

Output:

=== GPTBot ===
145 [01/Jan/2026
152 [02/Jan/2026
0   [03/Jan/2026
0   [04/Jan/2026
148 [05/Jan/2026
...

Calculate average:

grep "GPTBot" /var/log/nginx/access.log* | \
awk '{print $4}' | cut -d: -f1 | sort | uniq -c | \
awk '{sum+=$1; days++} END {print "Avg requests/day:", sum/days}'

Result: Avg requests/day: 142.3

Interpret:

<10 req/day: Minimal crawling (spot-checking, low-priority content)
10-100 req/day: Light crawling (periodic checks, selective content)
100-1,000 req/day: Moderate crawling (regular site coverage)
1,000-10,000 req/day: Heavy crawling (comprehensive indexing)
>10,000 req/day: Intensive crawling (real-time monitoring or training data collection)

Peak vs. Off-Peak Patterns

Do crawlers scrape evenly or in bursts?

Hourly distribution analysis:

grep "GPTBot" /var/log/nginx/access.log | \
awk -F: '{print $2}' | sort | uniq -c | sort -rn

Output:

4523 03  (3am)
4234 04  (4am)
3987 02  (2am)
3821 05  (5am)
1245 15  (3pm)
892  14  (2pm)

Pattern: GPTBot scrapes primarily 2am-5am (off-peak hours).

Implication: Polite crawler minimizing impact on human traffic.

Compare to PerplexityBot:

2341 09  (9am)
2287 10  (10am)
2198 14  (2pm)
2134 11  (11am)

Pattern: Evenly distributed across daytime hours.

Implication: Real-time answer engine, scrapes in response to user queries (peaks during waking hours).

Polite vs. aggressive:

Polite: Off-peak scraping, gradual ramp-up, respects server signals

Aggressive: Peak-hour scraping, ignores 429 rate limits, hammers origin servers

Benchmark Data by Bot Type

Training Bots (GPTBot, CCBot, Anthropic-AI)

Purpose: Collect data for model training.

Frequency pattern: Episodic (weeks/months between major scrapes).

GPTBot benchmarks (2025-2026 data):

Small publisher (100K monthly visitors):

Visit frequency: Every 14-21 days
Requests per visit: 150-300
Total monthly: 600-1,200 requests
Peak rate: 2-3 requests/minute

Medium publisher (1M monthly visitors):

Visit frequency: Every 7-10 days
Requests per visit: 1,500-3,000
Total monthly: 6,000-12,000 requests
Peak rate: 8-12 requests/minute

Large publisher (10M+ monthly visitors):

Visit frequency: Every 3-5 days
Requests per visit: 15,000-30,000
Total monthly: 90,000-180,000 requests
Peak rate: 50-100 requests/minute

CCBot (Common Crawl):

Operates quarterly: Major scraping cycles every 3 months.

Monthly average low, but specific months see massive spikes.

Example timeline:

January: 500 requests (minimal activity)
February: 800 requests (minimal)
March: 45,000 requests (quarterly crawl)
April: 400 requests (minimal)

Average: 11,675 req/month, but clustered in specific weeks.

Anthropic ClaudeBot:

Similar to GPTBot but slightly more frequent.

Visit frequency: Every 5-7 days
More consistent volume (less bursty than GPTBot)

Benchmarks: 20-30% higher request volume than GPTBot for equivalent publisher size.

Answer Engine Bots (PerplexityBot, YouBot)

Purpose: Real-time content retrieval for user search queries.

Frequency pattern: Continuous (daily scraping, often hourly checks).

PerplexityBot benchmarks:

Small publisher:

Visit frequency: Daily (often multiple times per day)
Requests per day: 50-200
Total monthly: 1,500-6,000
Pattern: Distributed throughout day (peaks 9am-5pm)

Medium publisher:

Visit frequency: Hourly checks on popular articles
Requests per day: 800-2,000
Total monthly: 24,000-60,000

Large publisher:

Continuous monitoring (top articles checked every 15-30 minutes)
Requests per day: 5,000-15,000
Total monthly: 150,000-450,000

YouBot (You.com search):

Similar to PerplexityBot but lower volume (You.com has smaller user base).

Benchmarks: 40-60% of PerplexityBot volume.

Search Indexing Bots (Google-Extended, Bing-AI)

Google-Extended:

Separate from standard Googlebot. Specifically for AI/generative features (Google Bard, Search Generative Experience).

Frequency:

Revisit: Every 2-7 days (depends on content update frequency)
Requests: 30-50% of standard Googlebot volume

Bing-AI (Microsoft):

Powers Bing Chat, Copilot.

Frequency:

Revisit: Every 5-10 days
Requests: Lower than Google-Extended (Bing smaller market share)

Benchmark: Medium publisher sees 2,000-5,000 Google-Extended requests/month, 800-2,000 Bing-AI requests/month.

Industry-Specific Benchmarks

News Publishers

High scraping frequency. AI systems prioritize current events.

Typical pattern:

Breaking news articles: Scraped within 1-6 hours of publication
Evergreen content: Scraped every 7-14 days
Archives: Scraped quarterly (training data updates)

Major news site benchmarks:

GPTBot: 50,000-200,000 req/month
PerplexityBot: 100,000-500,000 req/month (real-time news queries)
ClaudeBot: 40,000-150,000 req/month

Small regional news:

GPTBot: 2,000-8,000 req/month
PerplexityBot: 5,000-20,000 req/month

Why higher: Breaking news, local coverage, timely information = high query volume in AI answer engines.

E-Commerce and Product Sites

Moderate frequency. Product info relatively stable.

Pattern:

New product pages: Scraped within 24-48 hours
Existing products: Revisited weekly (price/stock updates)
Reviews: Scraped when new reviews posted

Benchmarks:

Medium e-commerce (10K products):

PerplexityBot: 15,000-40,000 req/month (product queries)
GPTBot: 5,000-15,000 req/month
Answer engines dominate (users ask "best laptop under $1000", AI scrapes product pages)

Large marketplace:

PerplexityBot: 200,000-800,000 req/month
High variation based on product catalog size, review volume

Technical Documentation

Low-moderate frequency. Documentation updates infrequently.

Pattern:

Initial scrape: When documentation site launches
Revisits: Every 30-90 days (unless change detected)
Update-triggered: Scrape within days if sitemap shows new content

Benchmarks:

SaaS documentation site:

GPTBot: 1,000-5,000 req/month
PerplexityBot: 3,000-10,000 req/month (developers asking "how to" questions)

Why lower: Content stable, not time-sensitive, fewer user queries compared to news.

Academic and Research Content

Variable frequency. Depends on publication schedule.

Pattern:

New papers: Scraped within 1-2 weeks of publication
Established papers: Quarterly scraping for training updates
Preprint servers (arXiv): Higher frequency (new papers daily)

Benchmarks:

University research site:

GPTBot: 3,000-12,000 req/month
Google-Extended: 5,000-20,000 req/month (Scholar integration)

Medical journals:

Higher frequency (COVID-era pattern persists—rapid scraping of medical research for AI health queries).

PerplexityBot: 20,000-60,000 req/month

Abnormal Scraping Patterns

Detecting Excessive Frequency

When is scraping "too much"?

Red flags:

1. Frequency 5-10× above benchmark

If comparable sites see 5,000 GPTBot requests/month and you see 50,000—investigate.

Possible causes:

Your content uniquely valuable (legitimate high demand)
Crawler misconfigured (scraping same pages repeatedly)
You're being targeted for comprehensive archive scraping

2. No respect for HTTP 429 (rate limit) responses

Server returns 429 (Too Many Requests). Polite crawler backs off. Aggressive crawler ignores signal, continues hammering.

Detection:

# Check if bot received 429s and persisted
grep "GPTBot" /var/log/nginx/access.log | grep " 429 " | wc -l

If hundreds of 429 responses but requests continue → violation.

3. Scraping during server maintenance windows

You set maintenance mode (503 errors). Bot should back off. If scraping intensifies during 503 period → poorly configured crawler.

4. Identical content requested repeatedly

Bot requests same article 50 times in one hour.

Legitimate: CDN cache miss, crawler revalidating.

Suspicious: >10 requests to identical URL within hour suggests crawler ignoring caching.

Rate Limit Violations

License agreements often include request quotas.

Example clause: "Licensee limited to 10,000 requests per month."

Enforcement:

# Count monthly requests from licensed bot
grep "ClaudeBot" /var/log/nginx/access.log.$(date +%Y-%m)* | wc -l

Result: 23,450 requests

Violation: 13,450 requests over quota (135% of limit).

Action:

Document violation (export logs, count requests)
Notify AI company (email partnership/legal contact)
Demand compliance or renegotiation (higher quota + higher fees)
Enforce penalties (if contract includes breach remedies)

Technical enforcement (automatic blocking at quota):

# Track requests per month per bot
# (Simplified example—production systems use Redis/Memcached for tracking)

limit_req_zone $http_user_agent zone=claudebot_monthly:10m rate=10000r/month;

location / {
    if ($http_user_agent ~* "ClaudeBot") {
        limit_req zone=claudebot_monthly;
    }
}

Once quota reached, return 429 for remainder of month.

Burst vs. Sustained High Volume

Distinguish between:

Burst scraping: 10,000 requests in one day, then quiet for month.

Sustained scraping: 333 requests/day consistently for 30 days (same total, different pattern).

Burst is often acceptable (training cycle, quarterly archive update).

Sustained high volume might violate intent (if license covers "periodic training" but bot scrapes continuously).

Analysis:

daily_counts = [get_request_count(date) for date in last_30_days]

burst_pattern = max(daily_counts) > 5 * statistics.mean(daily_counts)

if burst_pattern:
    print("Burst scraping detected")
else:
    print("Sustained scraping pattern")

Licensing implications:

Training license: Burst pattern expected.

Retrieval license (answer engines): Sustained pattern expected.

If pattern doesn't match license type → potential violation.

Frequency in Licensing Negotiations

Setting Request Quotas

Licensing deals should define frequency limits.

Model clause:

"Licensee may access up to [X] requests per calendar month. Excess usage billed at $[Y] per 1,000 additional requests."

How to set X:

Measure baseline: Current scraping volume from bot
Apply growth buffer: Increase by 50-100% to allow AI company growth
Align with content value: Premium content = tighter quotas (encourage licensing tiers)

Example:

Current GPTBot volume: 15,000 req/month

Quota options:

Conservative: 20,000/month (33% buffer)
Moderate: 30,000/month (100% buffer)
Generous: 50,000/month (233% buffer)

Overage pricing:

Base license fee: $25,000/year

Overage rate: $1 per 1,000 requests

If bot uses 35,000 requests (5,000 over 30K quota):

Overage fee: 5 × $1 = $5/month = $60/year

Overage should be priced to discourage abuse but not punitive (unless strategic decision to restrict access heavily).

Tiered Access Models

Different frequency limits for different license tiers.

Example structure:

Basic Tier: $10,000/year

10,000 requests/month
Weekly crawl frequency
Off-peak hours only (10pm-6am)

Standard Tier: $35,000/year

50,000 requests/month
Daily crawl frequency
Any time access

Premium Tier: $100,000/year

200,000 requests/month
Hourly crawl frequency
Real-time API access (more efficient than scraping)

Benefit: AI company scales access based on need. You capture value from high-volume use cases.

Crawl Politeness Requirements

License can mandate scraping etiquette.

Model clauses:

"Licensee shall:

a) Limit request rate to maximum [5] requests per second.

b) Respect HTTP 429 responses (back off for [30] seconds before retry).

c) Include Licensee contact info in user agent string or HTTP headers (for technical support communication).

d) Preferentially scrape during off-peak hours ([10pm-6am local time]) for non-urgent content.

e) Implement exponential backoff on 5xx server errors (do not retry immediately on server failure)."

Enforcement:

Monitor for violations. Document non-compliance. Escalate to legal if persistent.

Technical enforcement:

# Limit rate to 5 req/sec
limit_req_zone $http_user_agent zone=gptbot:10m rate=5r/s;

location / {
    if ($http_user_agent ~* "GPTBot") {
        limit_req zone=gptbot burst=10;
    }
}

Automatic compliance. Bot exceeding rate gets 429s, forced to slow down.

FAQ

What's considered normal AI crawler frequency for a medium-sized news site?

Medium news site (1M monthly visitors) typically sees: GPTBot 10,000-30,000 req/month (weekly visits, 1,500-4,000 req/visit), PerplexityBot 30,000-100,000 req/month (daily/hourly checks on breaking news), ClaudeBot 8,000-25,000 req/month. Total AI crawler traffic: 50,000-150,000 req/month (5-15% of total site traffic). Higher for breaking news outlets, lower for feature-focused publications. If seeing 500,000+ req/month from single bot, likely excessive unless you're top-tier national publication.

How do I know if a crawler is violating reasonable frequency limits?

Compare to benchmarks (this article's industry data), check for red flags: (1) Request volume 10× industry norm, (2) Ignores HTTP 429 rate limits, (3) Scrapes same content repeatedly (>5 times/hour), (4) Continues during server errors (503s), (5) Peak-hour scraping when off-peak would suffice. Measure deviation from baseline: Calculate 30-day average, alert if daily volume exceeds 3× standard deviation. License violation: If contract specifies quota and bot exceeds, automatic violation regardless of industry norms.

Should licensing deals include different frequency limits for training vs. retrieval bots?

Yes. Training bots (GPTBot) need periodic comprehensive scraping (weekly/monthly) = bursty pattern, moderate total volume. Answer engines (PerplexityBot) need continuous fresh data = sustained high volume, real-time access. Structure accordingly: Training license with lower overall quota but burst tolerance. Retrieval license with higher quota but rate limiting to spread load. Example: Training = 20,000 req/month (burstable to 5,000/day), Retrieval = 100,000 req/month (capped at 150/hour). Prevents training bot from real-time hammering, prevents retrieval bot from one-day archive dumps.

How often should I audit crawler frequency compliance?

Weekly monitoring for licensed crawlers (automated alerts if quota exceeded), Monthly deep review (analyze patterns, compare to license terms, identify violations), Quarterly benchmarking (compare your traffic to industry norms, adjust quotas if needed). Trigger immediate audit if: (1) Sudden traffic spike (>200% of baseline), (2) Server performance degrades, (3) New licensing deal begins (verify compliance from start). Automated monitoring prevents issues; manual audits catch sophisticated violations.

Can I require AI companies to scrape only during off-peak hours?

Legally yes (via licensing agreement), practically difficult for answer engines. Training bots can comply (scrape 2am-6am when server load low). Answer engines serve user queries 24/7—need real-time content access, can't wait for off-peak. Compromise: Require off-peak for comprehensive scraping (full-site crawls), allow anytime for targeted retrieval (specific articles referenced in user queries). Clause: "Bulk scraping operations (>1,000 pages/hour) limited to off-peak hours (10pm-6am local time). Individual article requests permitted anytime, maximum 10 requests/second." Balances publisher server capacity with AI product needs.

When Blocking AI Crawlers Isn't the Move

Skip this if:

Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.

Frequently Asked Questions

Should I block all AI crawlers from my site?

Not necessarily. Blocking indiscriminately cuts you off from AI-powered search results and citation traffic. The better approach is selective access — allow crawlers from platforms that drive referral traffic or pay for content, block those that only scrape without attribution. Start with robots.txt analysis, then layer in more granular controls based on your traffic data.

How do I know which AI bots are crawling my site?

Check your server access logs for user-agent strings containing GPTBot, ClaudeBot, Googlebot (with AI-related query patterns), Bytespider, CCBot, and others. Most hosting platforms expose these in analytics. If you lack raw log access, tools like Cloudflare or server-side middleware can surface bot traffic patterns without custom infrastructure.

Can I monetize AI crawler access to my content?

Some publishers are negotiating licensing deals directly with AI companies. For smaller sites, the practical path is controlling access (robots.txt, rate limiting, paywalling API endpoints) and measuring whether AI-sourced citation traffic converts. The pay-per-crawl model is emerging but not standardized — position yourself by documenting your content value and traffic patterns now.

How Often Do AI Crawlers Hit Your Site? Crawl Frequency Benchmarks

Measuring Crawl Frequency

Request Rate vs. Visit Frequency

Calculating Average Requests Per Day

Peak vs. Off-Peak Patterns

Benchmark Data by Bot Type

Training Bots (GPTBot, CCBot, Anthropic-AI)

Answer Engine Bots (PerplexityBot, YouBot)

Search Indexing Bots (Google-Extended, Bing-AI)

Industry-Specific Benchmarks

News Publishers

E-Commerce and Product Sites

Technical Documentation

Academic and Research Content

Abnormal Scraping Patterns

Detecting Excessive Frequency

Rate Limit Violations

Burst vs. Sustained High Volume

Frequency in Licensing Negotiations

Setting Request Quotas

Tiered Access Models

Crawl Politeness Requirements

FAQ

What's considered normal AI crawler frequency for a medium-sized news site?

How do I know if a crawler is violating reasonable frequency limits?

Should licensing deals include different frequency limits for training vs. retrieval bots?

How often should I audit crawler frequency compliance?

Can I require AI companies to scrape only during off-peak hours?

When Blocking AI Crawlers Isn't the Move

Frequently Asked Questions

Should I block all AI crawlers from my site?

How do I know which AI bots are crawling my site?

Can I monetize AI crawler access to my content?

This is one piece of the system.