HAProxy AI Crawler Rate Limiting: Advanced Traffic Shaping for Bot Management
Quick Summary
- What this covers: Implement sophisticated AI crawler rate limiting with HAProxy using user-agent detection, stick tables, and dynamic rate controls. Production-ready configs included.
- Who it's for: publishers and site owners managing AI bot traffic
- Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.
HAProxy provides sophisticated traffic control capabilities that surpass basic robots.txt directives when managing AI crawler access. While robots.txt requests compliance, HAProxy enforces limits regardless of crawler cooperation through application-layer filtering, dynamic rate tables, and multi-dimensional throttling strategies. Publishers facing aggressive AI crawler traffic, negotiating licensing agreements with usage caps, or operating under strict bandwidth budgets gain precise control through HAProxy configurations that permit desired crawler access while preventing infrastructure overload.
Why HAProxy for Crawler Management
Web servers like Nginx and Apache include rate limiting modules, but HAProxy's specialized load balancing and traffic shaping features provide advantages for complex crawler control scenarios. HAProxy operates as a reverse proxy sitting between clients and origin servers, inspecting requests before they reach application infrastructure, allowing denial or throttling before expensive backend processing occurs.
Stick tables enable stateful tracking across multiple HAProxy instances in load-balanced deployments. A crawler hitting multiple frontend servers still encounters unified rate limits because stick tables share state. Traditional web server rate limiting operates per-instance, allowing crawlers to bypass limits by distributing requests across servers. HAProxy closes this loophole.
Dynamic rate adjustment based on backend health distinguishes HAProxy from simpler approaches. If origin servers show elevated error rates or response times, HAProxy can automatically tighten crawler rate limits to preserve capacity for human users. When backend health improves, crawler limits relax, maximizing training data delivery within infrastructure constraints.
Multi-criteria filtering combines user-agent detection with IP-based rules, geographic origin, time-of-day patterns, and request characteristics. Block GPTBot from specific IP ranges during peak hours while allowing access at 2 AM. Permit ClaudeBot to request HTML at 20/minute but restrict PDF downloads to 5/minute. HAProxy's ACL (Access Control List) system enables sophisticated policies robots.txt cannot express.
Logging and metrics integration feeds crawler traffic data into monitoring platforms like Prometheus, Grafana, or Datadog. Track per-crawler request rates, bandwidth consumption, error rates, and throttling events in real-time dashboards. This visibility supports billing under usage-based licensing agreements and provides evidence for copyright enforcement actions.
Basic User-Agent Rate Limiting Configuration
A fundamental HAProxy configuration limiting GPTBot requests establishes the pattern for more complex policies:
frontend web_frontend
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/site.pem
# Define GPTBot ACL
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Track GPTBot requests in stick table
stick-table type string len 64 size 100k expire 60s store http_req_rate(60s)
http-request track-sc0 req.hdr(User-Agent) if is_gptbot
# Rate limit: 10 requests per minute
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(0) gt 10 }
default_backend web_servers
backend web_servers
balance roundrobin
server web1 192.168.1.10:80 check
server web2 192.168.1.11:80 check
This configuration identifies GPTBot via User-Agent header, tracks requests in a stick table with 60-second expiry, and denies requests exceeding 10 per minute with HTTP 429 status. The stick table stores request rates per unique user-agent string, enabling per-crawler limits.
Key components:
acl is_gptbot: Defines boolean condition matching GPTBot inUser-Agentheader (case-insensitive via-iflag)stick-table: Creates in-memory table tracking 100,000 unique user-agent strings with 60-second expirationhttp-request track-sc0: Increments counter for matched requestshttp-request deny: Blocks requests when tracked rate exceeds threshold (10 req/min)deny_status 429: Returns "Too Many Requests" status code
Testing this configuration:
for i in {1..15}; do
curl -A "GPTBot/1.0" https://example.com/
sleep 5
done
First 10 requests succeed; subsequent requests receive 429 responses until the 60-second window slides forward allowing new requests.
Multi-Crawler Differentiated Rate Limits
Production environments with multiple AI crawlers require per-crawler limits reflecting licensing agreements, infrastructure capacity, and strategic priorities:
frontend web_frontend
bind *:443 ssl crt /etc/haproxy/certs/site.pem
# Define crawler ACLs
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
acl is_claudebot hdr_sub(User-Agent) -i ClaudeBot
acl is_google_extended hdr_sub(User-Agent) -i Google-Extended
acl is_ccbot hdr_sub(User-Agent) -i CCBot
# Stick tables for rate tracking
stick-table type string len 64 size 100k expire 60s store http_req_rate(60s),bytes_out_rate(60s)
# Track all crawlers
http-request track-sc0 req.hdr(User-Agent) if is_gptbot or is_claudebot or is_google_extended or is_ccbot
# Per-crawler rate limits (requests/minute)
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(0) gt 20 }
http-request deny deny_status 429 if is_claudebot { sc_http_req_rate(0) gt 15 }
http-request deny deny_status 429 if is_google_extended { sc_http_req_rate(0) gt 30 }
http-request deny deny_status 429 if is_ccbot { sc_http_req_rate(0) gt 10 }
# Bandwidth limits (MB/minute)
http-request deny deny_status 429 if is_gptbot { sc_bytes_out_rate(0) gt 52428800 } # 50 MB/min
http-request deny deny_status 429 if is_claudebot { sc_bytes_out_rate(0) gt 31457280 } # 30 MB/min
default_backend web_servers
This implements:
- GPTBot: 20 requests/minute, 50 MB/minute bandwidth cap
- ClaudeBot: 15 requests/minute, 30 MB/minute bandwidth cap
- Google-Extended: 30 requests/minute, no bandwidth cap
- CCBot (Common Crawl): 10 requests/minute, no bandwidth cap
Bandwidth tracking via bytes_out_rate enforces data transfer limits independent of request counts, critical when crawlers request large PDFs or media files. A crawler staying within request limits but downloading 100 MB/minute in large files still hits bandwidth throttles.
IP-Based Verification and Spoofing Prevention
Crawlers can spoof User-Agent headers, claiming to be GPTBot while originating from non-OpenAI infrastructure. HAProxy can combine user-agent filtering with IP range verification:
frontend web_frontend
# Define legitimate OpenAI IP ranges
acl openai_ips src 52.12.0.0/14 54.0.0.0/8 23.20.0.0/14
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Legitimate GPTBot: correct user-agent AND valid IP
acl legit_gptbot is_gptbot openai_ips
# Spoofed GPTBot: correct user-agent but invalid IP
acl spoofed_gptbot is_gptbot !openai_ips
# Block spoofed crawlers immediately
http-request deny deny_status 403 if spoofed_gptbot
# Rate limit legitimate crawlers
stick-table type string len 64 size 100k expire 60s store http_req_rate(60s)
http-request track-sc0 req.hdr(User-Agent) if legit_gptbot
http-request deny deny_status 429 if legit_gptbot { sc_http_req_rate(0) gt 20 }
default_backend web_servers
IP ranges listed above are examples—publishers should maintain current lists from OpenAI documentation or observed legitimate crawler IPs. Automate updates:
#!/bin/bash
# Fetch current OpenAI IP ranges and update HAProxy ACL
curl -s https://openai.com/crawler-ips.json | \
jq -r '.prefixes[]' > /etc/haproxy/openai_ips.lst
# Reload HAProxy configuration
systemctl reload haproxy
This script refreshes IP lists daily via cron, keeping ACL current as OpenAI provisions new infrastructure.
Time-Based Rate Adjustment
Publishers often prefer allowing heavier crawler access during off-peak hours when human traffic is low. HAProxy ACLs support time-based rules:
frontend web_frontend
# Define time periods
acl peak_hours hdr_sub(Date) -m str -i hour=08,09,10,11,12,13,14,15,16,17,18
acl off_peak_hours hdr_sub(Date) -m str -i hour=00,01,02,03,04,05,06,19,20,21,22,23
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
stick-table type string len 64 size 100k expire 60s store http_req_rate(60s)
http-request track-sc0 req.hdr(User-Agent) if is_gptbot
# Strict limits during peak hours (8 AM - 6 PM)
http-request deny deny_status 429 if is_gptbot peak_hours { sc_http_req_rate(0) gt 5 }
# Relaxed limits during off-peak (6 PM - 8 AM)
http-request deny deny_status 429 if is_gptbot off_peak_hours { sc_http_req_rate(0) gt 30 }
default_backend web_servers
This throttles GPTBot to 5 requests/minute during daytime hours, allowing 30 requests/minute at night. Adjust thresholds based on traffic patterns—analyze logs to identify true peak periods for your audience.
Alternative approach using native time-based ACLs:
frontend web_frontend
acl is_business_hours hour 8:00-18:00
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Dynamic rate limit based on time
http-request set-var(txn.rate_limit) int(5) if is_business_hours
http-request set-var(txn.rate_limit) int(30) if !is_business_hours
stick-table type string len 64 size 100k expire 60s store http_req_rate(60s)
http-request track-sc0 req.hdr(User-Agent) if is_gptbot
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(0),sub(txn.rate_limit) gt 0 }
default_backend web_servers
This sets a transaction variable (txn.rate_limit) dynamically based on time, then compares actual rate against the variable. More maintainable than hardcoding thresholds in multiple rules.
Content-Type and Path-Based Restrictions
AI training benefits from HTML content more than static resources. Publishers can allow HTML access while restricting or blocking images, CSS, JavaScript, and binaries:
frontend web_frontend
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Define content types
acl html_request path_end -i .html .htm
acl image_request path_end -i .jpg .jpeg .png .gif .webp
acl video_request path_end -i .mp4 .avi .mov .webm
acl doc_request path_end -i .pdf .doc .docx .xls .xlsx
acl static_request path_end -i .css .js .woff .ttf
# Block static resources for crawlers
http-request deny deny_status 403 if is_gptbot static_request
# Aggressive limits for media
stick-table type string len 128 size 100k expire 300s store http_req_rate(300s)
http-request track-sc0 str(gptbot-media),req.hdr(User-Agent),path if is_gptbot image_request or video_request
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(0) gt 10 } image_request or video_request
# Moderate limits for documents
http-request track-sc1 str(gptbot-docs),req.hdr(User-Agent) if is_gptbot doc_request
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(1) gt 20 } doc_request
# Permissive limits for HTML
http-request track-sc2 str(gptbot-html),req.hdr(User-Agent) if is_gptbot html_request
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(2) gt 50 } html_request
default_backend web_servers
This configuration:
- Blocks CSS/JS/fonts completely (403 Forbidden)
- Limits images/video to 10 requests per 5 minutes
- Limits PDFs/Office docs to 20 requests per 5 minutes
- Permits HTML at 50 requests per 5 minutes
Separate stick table entries (sc0, sc1, sc2) track different content types independently, preventing HTML requests from consuming media quotas.
Path-based restrictions protect premium content sections:
frontend web_frontend
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Content sections
acl public_content path_beg /blog /articles /public
acl premium_content path_beg /premium /members /exclusive
acl licensed_content path_beg /licensed-archive
# Block premium content entirely
http-request deny deny_status 403 if is_gptbot premium_content
# Allow licensed content (may have separate agreement)
# Apply standard rate limits
# Permissive public content
# Apply relaxed rate limits
default_backend web_servers
Backend Health-Aware Dynamic Rate Limiting
HAProxy can adjust crawler limits based on backend server health, preserving capacity for human users during degraded performance:
frontend web_frontend
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Check backend health
acl backend_healthy nbsrv(web_servers) ge 2
acl backend_degraded nbsrv(web_servers) eq 1
acl backend_critical nbsrv(web_servers) lt 1
# Block crawlers entirely if backend critical
http-request deny deny_status 503 if is_gptbot backend_critical
# Tight limits if degraded
stick-table type string len 64 size 100k expire 60s store http_req_rate(60s)
http-request track-sc0 req.hdr(User-Agent) if is_gptbot
http-request deny deny_status 429 if is_gptbot backend_degraded { sc_http_req_rate(0) gt 5 }
# Normal limits if healthy
http-request deny deny_status 429 if is_gptbot backend_healthy { sc_http_req_rate(0) gt 20 }
default_backend web_servers
backend web_servers
balance roundrobin
option httpchk GET /health
http-check expect status 200
server web1 192.168.1.10:80 check inter 5s
server web2 192.168.1.11:80 check inter 5s
server web3 192.168.1.12:80 check inter 5s
The nbsrv() function returns count of healthy servers in backend pool. Rules adjust crawler limits:
- All servers down: Return 503, block crawlers
- One server up: Limit crawlers to 5 req/min
- Two+ servers up: Allow normal 20 req/min
This prioritizes human traffic during infrastructure issues while resuming crawler access when capacity permits.
Licensing Agreement Enforcement
Publishers with usage-based licensing agreements requiring precise crawler control implement contract terms through HAProxy:
frontend web_frontend
# OpenAI contract: 50 GB/month, max 20 req/min
acl is_gptbot hdr_sub(User-Agent) -i GPTBot
# Track bandwidth (bytes per second, convert to MB/month)
stick-table type string len 64 size 10k expire 2592000s store bytes_out_rate(2592000s)
http-request track-sc0 str(gptbot-monthly),req.hdr(User-Agent) if is_gptbot
# 50 GB = 53687091200 bytes
http-request deny deny_status 402 if is_gptbot { sc_bytes_out_rate(0) gt 53687091200 }
# Request rate limit
stick-table type string len 64 size 10k expire 60s store http_req_rate(60s)
http-request track-sc1 str(gptbot-rate),req.hdr(User-Agent) if is_gptbot
http-request deny deny_status 429 if is_gptbot { sc_http_req_rate(1) gt 20 }
default_backend web_servers
The monthly bandwidth cap uses 2,592,000-second (30-day) stick table expiry, tracking cumulative bytes transferred. Once GPTBot exceeds 50 GB in rolling 30-day window, requests receive HTTP 402 (Payment Required) status, signaling billing overage.
Logging these enforcement events provides billing documentation:
frontend web_frontend
# ... rate limiting rules ...
# Log rate limit denials
http-request capture req.hdr(User-Agent) len 128
http-request set-var(txn.rate_exceeded) bool(true) if is_gptbot { sc_http_req_rate(1) gt 20 }
http-response set-header X-Rate-Limit-Exceeded true if { var(txn.rate_exceeded) }
log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %[capture.req.hdr(0)] rate_exceeded=%[var(txn.rate_exceeded)]"
This custom log format includes rate limit exceeded flag, enabling post-processing scripts to generate monthly usage reports:
#!/bin/bash
# Generate GPTBot usage report
grep 'GPTBot' /var/log/haproxy.log | \
awk '{ bytes+=$10 } END { print "Total GPTBot bandwidth: " bytes/1024/1024/1024 " GB" }'
grep 'rate_exceeded=true' /var/log/haproxy.log | \
awk '{ print $1, $2 }' | \
uniq -c > gptbot_throttle_events.txt
Monitoring and Observability
HAProxy integrates with monitoring systems via stats socket, Prometheus exporter, and syslog:
global
stats socket /var/run/haproxy.sock mode 660 level admin
stats timeout 30s
# Prometheus metrics endpoint
stats bind-process 1
stats socket [email protected]:9101 level admin expose-fd listeners
frontend stats
bind *:8404
http-request use-service prometheus-exporter if { path /metrics }
stats enable
stats uri /stats
stats refresh 10s
Query stick table contents via stats socket:
echo "show table web_frontend" | socat stdio /var/run/haproxy.sock
This displays current entries in stick tables, showing which crawlers are being tracked and their current request rates.
Prometheus queries for crawler metrics:
# GPTBot requests per second
rate(haproxy_frontend_http_requests_total{user_agent="GPTBot"}[5m])
# GPTBot rate limit denials
rate(haproxy_frontend_denied_requests_total{user_agent="GPTBot",reason="rate_limit"}[5m])
# Crawler bandwidth consumption
rate(haproxy_frontend_bytes_out_total{user_agent="GPTBot"}[1h])
Grafana dashboards visualize these metrics, alerting when crawler traffic exceeds expected patterns or licensing agreement thresholds.
Frequently Asked Questions
Can HAProxy differentiate crawlers if they use the same IP ranges?
Yes, via user-agent strings. HAProxy inspects HTTP headers at application layer, distinguishing GPTBot from ClaudeBot even if both originate from overlapping cloud provider IP ranges. Combine user-agent ACLs with IP validation to prevent spoofing while enabling granular per-crawler controls.
Does HAProxy rate limiting work with CDN platforms like Cloudflare?
Yes, but ensure HAProxy sees actual crawler IPs via X-Forwarded-For or CDN-specific headers. Configure HAProxy to trust CDN IP ranges and extract client IPs from forwarded headers. Rate limits then apply to original crawler IPs rather than CDN edge IPs.
How do I test HAProxy rate limiting without deploying to production?
Use HAProxy in a local Docker container with test configuration. Send requests via curl with different user-agents and rates, observing 429 responses when limits are exceeded. Verify stick table contents via stats socket to confirm tracking works as intended before promoting to production.
Can I implement graduated rate limits that slow crawlers instead of blocking them?
HAProxy doesn't support TCP-level slow-down, but you can use tarpit directive to delay response:
http-request tarpit if is_gptbot { sc_http_req_rate(0) gt 20 }
This holds the connection open without responding, causing crawler to wait. However, hard denials (429 status) are clearer and allow crawlers to implement backoff.
How do HAProxy rate limits interact with robots.txt crawl-delay?
Independently. Robots.txt is advisory; crawlers can ignore Crawl-delay. HAProxy enforces limits regardless of crawler cooperation. Use both: robots.txt requests politeness, HAProxy guarantees compliance. If a crawler respects robots.txt delays, it stays within HAProxy limits naturally.
Conclusion
HAProxy provides production-grade crawler rate limiting that far exceeds robots.txt capabilities, enabling publishers to enforce licensing agreement terms, protect infrastructure during high-traffic events, and differentiate treatment of various AI crawlers based on business relationships. Combining user-agent detection, IP validation, time-based rules, content-type filtering, and backend-health awareness creates sophisticated policies matching real-world publisher needs. Stick tables, logging, and monitoring integration deliver the observability necessary for usage-based billing and compliance verification. Publishers pursuing AI crawler monetization or managing crawlers alongside human traffic benefit from HAProxy's flexibility, especially in environments where simple robots.txt controls prove insufficient against aggressive or non-compliant crawlers.
When Blocking AI Crawlers Isn't the Move
Skip this if:
- Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
- You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
- Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.