Cloudflare Workers for AI Crawler Logic — Custom Bot Detection at the Edge
Quick Summary
- What this covers: Build serverless crawler detection with Cloudflare Workers. Rate limiting via KV storage, dynamic user agent blocking, and request fingerprinting without origin server load.
- Who it's for: publishers and site owners managing AI bot traffic
- Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.
Cloudflare Workers position JavaScript logic at network egress points, intercepting HTTP requests before they traverse to origin servers. This architecture enables bot detection, rate limiting, and access control without consuming backend resources or introducing latency penalties that compound across distributed infrastructure.
For publishers negotiating AI training licenses, Workers provide enforcement granularity unavailable in robots.txt or firewall rules. You can implement dynamic allowlists, usage-based throttling, and cryptographic verification of licensed crawler access, all executing in under 5 milliseconds at Cloudflare's edge.
The operational benefit: your origin server never sees blocked requests. A site receiving 10,000 AI crawler requests per hour can shed this traffic at the edge, preserving compute capacity for legitimate users while maintaining technical leverage in licensing negotiations.
Workers Runtime Environment
Cloudflare Workers execute in a V8 isolate, the same JavaScript runtime that powers Chrome. Each request spawns an isolated execution context with access to:
- Request/Response objects — Full HTTP header and body access
- KV storage — Globally distributed key-value database with eventual consistency
- Durable Objects — Strongly consistent coordination primitives for stateful logic
- Fetch API — Ability to proxy requests to origin or external services
- Crypto API — HMAC, SHA-256, and other cryptographic functions for request signing
Execution limits constrain complexity: 50ms CPU time per request on free tier, 128MB memory, 1MB script size after compression. These boundaries favor focused logic (rate limiting, authentication checks) over heavyweight processing (full HTML parsing, machine learning inference).
For AI crawler detection, Workers excel at pattern matching, header analysis, and coordinated rate limiting across geographic regions. Computationally expensive tasks (behavioral analysis, anomaly detection) remain better suited for origin servers or third-party bot management services.
User Agent Matching — The First Filter
AI training crawlers self-identify via user agent strings when regulatory compliance or reputational considerations outweigh evasion benefits. GPTBot, ClaudeBot, Google-Extended, and CCBot broadcast their identity in headers.
A basic Worker that blocks known AI crawlers:
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
const AI_CRAWLERS = [
'GPTBot',
'ClaudeBot',
'Google-Extended',
'CCBot',
'anthropic-ai',
'cohere-ai',
'PerplexityBot',
'Bytespider', // TikTok/ByteDance
'Diffbot',
'ImagesiftBot'
]
async function handleRequest(request) {
const userAgent = request.headers.get('User-Agent') || ''
for (const bot of AI_CRAWLERS) {
if (userAgent.includes(bot)) {
return new Response('AI crawler access restricted. Contact [email protected]', {
status: 403,
headers: { 'Content-Type': 'text/plain' }
})
}
}
return fetch(request)
}
This Worker intercepts every request, checks the user agent against a crawler list, and returns HTTP 403 for matches. Unlisted traffic passes through to the origin server via fetch(request).
The licensing message in the 403 response establishes contact for negotiation. When OpenAI or Anthropic engineers encounter this block, they understand that access requires business agreement, not technical workaround.
Rate Limiting with KV Storage
User agent blocking stops cooperative crawlers. Rate limiting constrains aggressive ones that rotate user agents or ignore robots.txt directives.
Cloudflare KV provides globally replicated key-value storage with eventual consistency. Workers can increment request counters per IP address, applying throttles when thresholds exceed.
Rate-limiting Worker skeleton:
addEventListener('fetch', event => {
event.respondWith(handleRequest(event.request))
})
async function handleRequest(request) {
const ip = request.headers.get('CF-Connecting-IP')
const rateLimitKey = `ratelimit:${ip}`
// Check current request count
const requestCount = await RATE_LIMIT_KV.get(rateLimitKey)
const count = requestCount ? parseInt(requestCount) : 0
// Allow 100 requests per hour per IP
if (count >= 100) {
return new Response('Rate limit exceeded. Retry after 1 hour.', {
status: 429,
headers: {
'Retry-After': '3600',
'Content-Type': 'text/plain'
}
})
}
// Increment counter with 1-hour TTL
await RATE_LIMIT_KV.put(rateLimitKey, (count + 1).toString(), {
expirationTtl: 3600
})
return fetch(request)
}
This Worker tracks requests per IP using KV storage. When an address exceeds 100 requests per hour, it receives HTTP 429 (Too Many Requests) with a Retry-After header indicating cooldown duration.
The expirationTtl parameter auto-deletes keys after 3600 seconds, resetting counters without manual cleanup. This prevents KV storage from accumulating stale rate limit entries.
Limitations: KV storage has eventual consistency (typically under 60 seconds globally). A distributed crawler hitting multiple Cloudflare data centers simultaneously might exceed rate limits during the propagation window. For stricter enforcement, Durable Objects provide strong consistency at higher cost.
Tiered Rate Limiting by Content Value
Uniform rate limits waste monetization opportunities. Premium content (original research, expert analysis) deserves tighter controls than commodity content (news aggregation, generic tutorials).
Implement tiered limits by matching request paths:
async function handleRequest(request) {
const url = new URL(request.url)
const ip = request.headers.get('CF-Connecting-IP')
// Define content tiers
let rateLimit = 100 // Default: 100 req/hour
if (url.pathname.startsWith('/research/')) {
rateLimit = 10 // Premium tier: 10 req/hour
} else if (url.pathname.startsWith('/articles/')) {
rateLimit = 50 // Standard tier: 50 req/hour
}
const rateLimitKey = `ratelimit:${url.pathname.split('/')[1]}:${ip}`
const count = parseInt(await RATE_LIMIT_KV.get(rateLimitKey) || '0')
if (count >= rateLimit) {
return new Response(`Rate limit exceeded for ${url.pathname}`, { status: 429 })
}
await RATE_LIMIT_KV.put(rateLimitKey, (count + 1).toString(), { expirationTtl: 3600 })
return fetch(request)
}
This logic sets rate limits based on URL path structure. /research/ endpoints throttle at 10 requests per hour, /articles/ at 50, everything else at 100.
The rate limit key incorporates path segment (/research/, /articles/) alongside IP address, preventing a single IP from exhausting limits across multiple content tiers simultaneously.
When AI labs encounter differentiated rate limits, they infer content value hierarchy, informing licensing negotiations. Labs willing to pay premium rates for /research/ access can propose tiered licensing agreements matching your technical enforcement structure.
Dynamic Crawler Allowlists for Licensed Access
Once you negotiate licensing agreements, enforcement must permit authorized crawler access without manual firewall updates.
Implement token-based authentication where licensed crawlers include a shared secret in request headers:
const LICENSED_TOKENS = {
'openai-license-2026': 'sk_live_abc123xyz',
'anthropic-license-2026': 'sk_live_def456uvw'
}
async function handleRequest(request) {
const authHeader = request.headers.get('X-Crawler-License')
const userAgent = request.headers.get('User-Agent') || ''
// Check if request includes valid license token
for (const [license, token] of Object.entries(LICENSED_TOKENS)) {
if (authHeader === token) {
console.log(`Licensed access granted: ${license}`)
return fetch(request) // Allow through
}
}
// Apply standard bot blocking for unlicensed traffic
if (userAgent.includes('GPTBot') || userAgent.includes('ClaudeBot')) {
return new Response('Unlicensed crawler access denied', { status: 403 })
}
return fetch(request)
}
Licensed crawlers include X-Crawler-License: sk_live_abc123xyz in their headers. The Worker validates this token against a hardcoded list, bypassing rate limits and blocks for matches.
This approach requires coordination with AI labs during licensing negotiations. You provide the token, they configure their crawlers to include it. The token acts as a cryptographic proof of license status without requiring real-time database lookups.
Security consideration: Tokens in Workers code are visible to anyone with account access. For higher security, store tokens in Cloudflare KV or Secrets (encrypted environment variables), fetching them at runtime rather than hardcoding.
Request Fingerprinting — Detecting Evasion
Sophisticated crawlers rotate user agents and IPs to evade detection. Secondary fingerprinting signals help identify automated traffic even when primary identifiers change.
TLS fingerprinting analyzes the cipher suite order and extensions advertised during HTTPS handshake. Browsers produce consistent fingerprints; scripted crawlers often deviate.
Cloudflare exposes TLS fingerprint data via request.cf.botManagement.ja3Hash in Enterprise plans. Workers can compare this hash against known browser profiles:
async function handleRequest(request) {
const ja3Hash = request.cf?.botManagement?.ja3Hash
const userAgent = request.headers.get('User-Agent') || ''
// Known browser JA3 hashes (example)
const legitimateBrowsers = [
'abc123...', // Chrome 120
'def456...', // Firefox 121
'ghi789...' // Safari 17
]
if (ja3Hash && !legitimateBrowsers.includes(ja3Hash)) {
console.log(`Suspicious TLS fingerprint: ${ja3Hash}`)
// Apply additional scrutiny (rate limit, challenge, log)
}
return fetch(request)
}
This detects mismatches between stated user agent (Chrome) and observed TLS behavior (Python requests library). When discrepancies surface, escalate to stricter rate limits or JavaScript challenges.
Header analysis provides another signal. Legitimate browsers send 15-20 headers (Accept-Language, DNT, Sec-Fetch-*, etc.). Minimal crawler requests often include only User-Agent, Host, and Connection.
Count headers and flag sparse requests:
async function handleRequest(request) {
const headerCount = Array.from(request.headers.keys()).length
if (headerCount < 10) {
console.log(`Minimal headers detected: ${headerCount}`)
// Potential bot traffic
}
return fetch(request)
}
Combining multiple signals (user agent, TLS fingerprint, header count, rate velocity) increases detection accuracy while reducing false positives.
Logging Crawler Activity for Licensing Insights
Technical enforcement generates telemetry that informs licensing strategy. Workers can log crawler behavior to external analytics platforms, building datasets that reveal:
- Which AI labs crawl your content most aggressively
- Which content categories attract crawler attention
- Temporal patterns (weekday vs. weekend, business hours vs. off-peak)
Forward logs to Datadog, Splunk, or Elasticsearch via Worker fetch requests:
async function logCrawlerActivity(request, blocked = false) {
const logEntry = {
timestamp: new Date().toISOString(),
ip: request.headers.get('CF-Connecting-IP'),
userAgent: request.headers.get('User-Agent'),
url: request.url,
blocked: blocked,
country: request.cf?.country,
asn: request.cf?.asn
}
await fetch('https://logs.yourdomain.com/crawler-activity', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(logEntry)
})
}
async function handleRequest(request) {
const userAgent = request.headers.get('User-Agent') || ''
if (userAgent.includes('GPTBot')) {
await logCrawlerActivity(request, true)
return new Response('Access denied', { status: 403 })
}
await logCrawlerActivity(request, false)
return fetch(request)
}
This logs every request to an external endpoint, including metadata like IP, user agent, and geographic location. The blocked flag indicates whether access was granted or denied.
Aggregate these logs into dashboards that quantify crawler demand. When OpenAI representatives ask about licensing terms, present data showing their crawlers attempted 50,000 requests last month, demonstrating concrete usage value.
Content Fingerprinting for Unauthorized Detection
Licensing agreements permit crawler access. But how do you verify that content appears only in authorized datasets?
Embed cryptographic fingerprints in HTML served to crawlers. When AI model outputs later surface suspiciously similar text, fingerprints provide forensic evidence of unauthorized training.
A Worker that injects invisible fingerprints:
async function handleRequest(request) {
const userAgent = request.headers.get('User-Agent') || ''
const response = await fetch(request)
// Inject fingerprint for AI crawlers
if (userAgent.includes('GPTBot') || userAgent.includes('ClaudeBot')) {
const body = await response.text()
const fingerprint = `<!-- FP:${generateFingerprint(request)} -->`
const fingerprintedBody = body.replace('</body>', `${fingerprint}</body>`)
return new Response(fingerprintedBody, {
headers: response.headers,
status: response.status
})
}
return response
}
function generateFingerprint(request) {
const data = `${request.url}:${Date.now()}:${Math.random()}`
return btoa(data) // Base64 encode
}
This injects an HTML comment containing a unique fingerprint before the closing </body> tag, visible to crawlers but invisible in rendered pages.
If AI model outputs later include text matching your articles, search for fingerprint patterns in the model's training corpus (if accessible via audits or legal discovery). Presence of fingerprints proves unauthorized use.
Limitation: AI labs can strip HTML comments during preprocessing. More robust fingerprinting embeds patterns in word choice, sentence structure, or punctuation that survive text extraction. Workers aren't ideal for this complexity; implement on origin servers with NLP libraries.
Durable Objects for Stateful Rate Limiting
KV storage's eventual consistency creates race conditions. A crawler might exceed rate limits before KV propagates updates across regions.
Durable Objects provide strong consistency, guaranteeing that rate limit counters reflect all requests instantly. Each Durable Object is a single-threaded compute instance with transactional state.
Rate-limiting Durable Object:
export class RateLimiter {
constructor(state, env) {
this.state = state
this.requests = {}
}
async fetch(request) {
const ip = new URL(request.url).searchParams.get('ip')
const limit = 100
if (!this.requests[ip]) {
this.requests[ip] = { count: 0, resetAt: Date.now() + 3600000 }
}
const record = this.requests[ip]
if (Date.now() > record.resetAt) {
record.count = 0
record.resetAt = Date.now() + 3600000
}
if (record.count >= limit) {
return new Response('Rate limit exceeded', { status: 429 })
}
record.count++
return new Response('Allowed', { status: 200 })
}
}
Workers invoke Durable Objects via namespace bindings:
async function handleRequest(request, env) {
const ip = request.headers.get('CF-Connecting-IP')
const objectId = env.RATE_LIMITER.idFromName(ip)
const stub = env.RATE_LIMITER.get(objectId)
const rateLimitResponse = await stub.fetch(`https://dummy.com?ip=${ip}`)
if (rateLimitResponse.status === 429) {
return new Response('Rate limit exceeded', { status: 429 })
}
return fetch(request)
}
This routes rate limit checks to a Durable Object keyed by IP address. All requests from the same IP hit the same Object, guaranteeing consistent counters.
Cost: Durable Objects incur higher charges than KV storage ($0.15 per million requests vs. $0.50 per million for KV reads). Reserve for scenarios where consistency justifies expense, like high-value content with strict access controls.
Challenge Pages for Bot Verification
Pure blocking sacrifices data. Serve challenge pages that test JavaScript execution capability, distinguishing headless crawlers from browsers.
A Worker that issues JavaScript challenges:
async function handleRequest(request) {
const cookie = request.headers.get('Cookie') || ''
const challengeSolved = cookie.includes('challenge_passed=true')
if (!challengeSolved) {
const challengePage = `
<!DOCTYPE html>
<html>
<head><title>Verify Access</title></head>
<body>
<h1>Verifying your browser...</h1>
<script>
// Compute challenge token
const token = btoa('verified:' + Date.now());
document.cookie = 'challenge_passed=true; Max-Age=3600';
window.location.reload();
</script>
</body>
</html>
`
return new Response(challengePage, {
headers: { 'Content-Type': 'text/html' }
})
}
return fetch(request)
}
First-time visitors receive an HTML page containing JavaScript that sets a cookie and reloads. Browsers execute this automatically. Headless crawlers without JavaScript engines fail to set the cookie and remain stuck on the challenge page.
This gracefully degrades: legitimate users experience a 1-2 second delay on first visit, then pass through freely. AI crawlers lacking JavaScript support get blocked without you manually maintaining user agent lists.
Geographic Blocking for Training Infrastructure
AI training infrastructure concentrates in specific cloud regions. OpenAI and Anthropic operate primarily out of US data centers. Smaller labs often use cheaper European or Asian hosting.
Workers can block or rate-limit traffic from specific countries:
async function handleRequest(request) {
const country = request.cf?.country
// Block traffic from high-crawler countries
const blockedCountries = ['CN', 'RU', 'VN']
if (blockedCountries.includes(country)) {
return new Response('Access restricted in your region', { status: 403 })
}
return fetch(request)
}
This blocks requests originating from China, Russia, and Vietnam, where many low-cost crawler services operate.
Caution: Legitimate users behind VPNs or privacy services may share these country codes. Implement challenges rather than hard blocks for marginally suspicious traffic, reserving blocks for confirmed abuse sources.
Combining Workers with Origin Server Logic
Workers handle high-volume filtering at the edge. Origin servers implement sophisticated analysis that would exceed Worker CPU limits.
A layered architecture:
- Worker tier — Block obvious AI crawlers by user agent, apply rate limits, issue JavaScript challenges
- Origin tier — Analyze behavioral patterns (session duration, click patterns), inject content fingerprints, log detailed telemetry
Workers set custom headers indicating suspected bot traffic:
async function handleRequest(request) {
const headers = new Headers(request.headers)
const userAgent = request.headers.get('User-Agent') || ''
if (userAgent.includes('GPTBot')) {
headers.set('X-Suspected-Bot', 'GPTBot')
}
return fetch(new Request(request, { headers }))
}
Origin servers read the X-Suspected-Bot header and apply additional scrutiny—serving watermarked content, logging to specialized databases, or serving lower-quality excerpts rather than full articles.
This division of labor optimizes cost: Workers handle simple pattern matching cheaply, origin servers handle expensive analytics only for flagged traffic.
FAQ
Do Cloudflare Workers slow down site performance?
Execution latency is typically under 5ms. Workers run at Cloudflare's edge, closer to users than origin servers, often reducing total response time by eliminating origin round-trips for blocked requests.
Can AI labs reverse-engineer Worker logic to evade detection?
Worker code is not publicly visible. However, motivated attackers can infer logic by testing various request patterns and observing responses. Combine multiple detection signals to increase evasion cost.
How do I test Worker logic without blocking real users?
Deploy Workers to a staging domain or use Cloudflare's route filters to apply Workers only to specific paths (/api/*) during testing. Validate behavior before rolling out to production traffic.
What happens when KV storage reaches consistency after a crawler already exceeded limits?
KV's eventual consistency means short-term limit violations are possible. For strict enforcement, use Durable Objects or implement compensating logic that retrospectively blocks IPs that exceeded limits.
Can I monetize crawler access directly through Workers?
Workers can validate license tokens and enforce payment-based access tiers, but payment processing requires integration with external services (Stripe, PayPal). Workers handle authorization logic; payment platforms handle transactions.
How do I rotate license tokens without redeploying Workers?
Store tokens in Cloudflare KV or Secrets rather than hardcoding in Worker scripts. Update KV entries or secrets dynamically without code changes.
Do Workers count against Cloudflare's request limits?
Workers execute on every request passing through Cloudflare. Free tier allows 100,000 requests per day. Paid Workers ($5/month) include 10 million requests, with $0.50 per additional million.
Can Workers inspect request bodies to detect scraping patterns?
Yes, but body inspection consumes more CPU time. Use request.text() or request.json() to access body content. Be mindful of Worker CPU limits (50ms free tier, 50ms+ on paid plans).
How do I handle false positives when aggressive users resemble bots?
Implement tiered responses: first offense gets a challenge, second gets rate limiting, third gets temporary block. Store offense counts in KV with TTLs that forgive after cooldown periods.
When Blocking AI Crawlers Isn't the Move
Skip this if:
- Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
- You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
- Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.
Frequently Asked Questions
Should I block all AI crawlers from my site?
Not necessarily. Blocking indiscriminately cuts you off from AI-powered search results and citation traffic. The better approach is selective access — allow crawlers from platforms that drive referral traffic or pay for content, block those that only scrape without attribution. Start with robots.txt analysis, then layer in more granular controls based on your traffic data.
How do I know which AI bots are crawling my site?
Check your server access logs for user-agent strings containing GPTBot, ClaudeBot, Googlebot (with AI-related query patterns), Bytespider, CCBot, and others. Most hosting platforms expose these in analytics. If you lack raw log access, tools like Cloudflare or server-side middleware can surface bot traffic patterns without custom infrastructure.
Can I monetize AI crawler access to my content?
Some publishers are negotiating licensing deals directly with AI companies. For smaller sites, the practical path is controlling access (robots.txt, rate limiting, paywalling API endpoints) and measuring whether AI-sourced citation traffic converts. The pay-per-crawl model is emerging but not standardized — position yourself by documenting your content value and traffic patterns now.