title:: Apache .htaccess Bot Management: Block and Throttle AI Crawlers at the Server Level description:: Configure Apache .htaccess rules to block, throttle, or conditionally route AI crawlers like GPTBot, ClaudeBot, and Bytespider. Full mod_rewrite examples included. focus_keyword:: apache htaccess bot management category:: implementation author:: Victor Valentine Romo date:: 2026.03.20

Apache .htaccess Bot Management: Block and Throttle AI Crawlers at the Server Level

Quick Summary

What this covers: apache-htaccess-bot-management

Who it's for: publishers and site owners managing AI bot traffic

Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.

Apache HTTP Server runs beneath roughly 30% of active websites. Its distributed configuration system — the .htaccess file — gives site operators per-directory control without touching the main server configuration. For publishers on shared hosting where the admin panel is all they get, .htaccess is the most powerful enforcement tool available.

AI crawlers don't care what server software you run. GPTBot scrapes Apache sites. ClaudeBot scrapes Apache sites. Bytespider definitely scrapes Apache sites. The crawlers send HTTP requests. Apache processes them. The only question is whether your configuration intercepts those requests before content gets served.

.htaccess sits between robots.txt (which asks nicely) and CDN-level blocking (which intercepts at the network edge). It's server-level enforcement accessible to anyone with FTP access to their document root. No SSH required. No admin privileges needed. Just a text file in the right directory with the right directives.

For Nginx-based servers, see the dedicated guide

How .htaccess Interception Works for Bot Management

Request Processing Order in Apache

When a request arrives at Apache, the server evaluates it through a defined sequence. Understanding this sequence reveals where .htaccess rules fit and why they're effective.

Connection established — TCP handshake completes
Request received — Apache reads the HTTP method, URI, and headers
Virtual host matched — Apache identifies which site configuration applies
Directory context evaluated — Apache walks the filesystem path, loading .htaccess files at each level
mod_rewrite rules evaluated — Rewrite conditions and rules execute
Access control checked — Allow/Deny directives apply
Content handler invoked — PHP, static file serving, or proxy pass

.htaccess rules execute at step 4-6, before content handlers run. A blocked AI crawler receives a 403 response without Apache ever invoking PHP, rendering templates, or querying databases. The resource savings compound: no PHP-FPM process spawned, no database connection opened, no memory allocated for page generation.

mod_rewrite vs. mod_setenvif vs. mod_authz

Apache provides multiple modules for user-agent-based filtering. Each has different characteristics.

mod_rewrite — The most flexible option. Evaluates regular expressions against any request attribute. Can redirect, block, or modify requests based on complex conditions. Most .htaccess bot management solutions use this module.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC]
RewriteRule .* - [F,L]

mod_setenvif — Sets environment variables based on request attributes. Lighter weight than mod_rewrite but less flexible. Useful for flagging crawlers and applying access rules downstream.

SetEnvIfNoCase User-Agent "GPTBot" ai_crawler
SetEnvIfNoCase User-Agent "ClaudeBot" ai_crawler
Order Allow,Deny
Allow from all
Deny from env=ai_crawler

mod_authz (Apache 2.4+) — Modern access control using the Require directive. More readable syntax but requires Apache 2.4, which some shared hosts still haven't updated to.

<RequireAll>
    Require all granted
    <RequireNone>
        Require expr "%{HTTP_USER_AGENT} =~ /GPTBot/i"
        Require expr "%{HTTP_USER_AGENT} =~ /ClaudeBot/i"
    </RequireNone>
</RequireAll>

For maximum compatibility across hosting environments, mod_rewrite is the safest bet. Nearly every Apache installation supports it, and shared hosting plans universally enable it.

Blocking AI Crawlers with .htaccess

Complete Block List Configuration

Place this in the .htaccess file at your document root:

# AI Crawler Management
# Updated: 2026-02-07
# Reference: https://aipaypercrawl.com/articles/ai-crawler-directory-2026.html

<IfModule mod_rewrite.c>
RewriteEngine On

# Block AI Training Crawlers
RewriteCond %{HTTP_USER_AGENT} GPTBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ClaudeBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CCBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Google-Extended [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Meta-ExternalAgent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Applebot-Extended [NC]
RewriteRule .* - [F,L]
</IfModule>

The [NC] flag makes matching case-insensitive. The [OR] flag chains conditions with logical OR — any matching condition triggers the rule. [F] returns 403 Forbidden. [L] stops further rule processing.

This block sits above any existing rewrite rules in your .htaccess. Order matters: if a prior rule matches and processes the request, later rules (including your crawler blocks) may never execute. AI crawler rules should be the first rewrite block encountered.

Selective Blocking by Content Directory

Not all content warrants the same protection level. Block AI crawlers from premium content while allowing access to commodity pages:

# Block AI crawlers from premium content directories
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot|Google-Extended) [NC]
RewriteCond %{REQUEST_URI} ^/research/ [OR]
RewriteCond %{REQUEST_URI} ^/analysis/ [OR]
RewriteCond %{REQUEST_URI} ^/premium/ [OR]
RewriteCond %{REQUEST_URI} ^/data/
RewriteRule .* - [F,L]

This configuration blocks AI crawlers from /research/, /analysis/, /premium/, and /data/ while allowing them to crawl everything else. The strategy works when you plan to monetize premium content through direct licensing while letting commodity content circulate freely as a discovery mechanism.

Custom Error Pages for Blocked Crawlers

A bare 403 response provides no information. Custom error pages communicate your licensing position:

# Custom error document for AI crawlers
ErrorDocument 403 /ai-licensing-required.html

Create ai-licensing-required.html at your document root:

<!DOCTYPE html>
<html>
<head>
    <title>Licensing Required</title>
    <meta name="robots" content="noindex">
</head>
<body>
    <h1>AI Training Access Requires a License</h1>
    <p>This domain licenses content for AI training at published rates.</p>
    <p>Review terms: <a href="/rsl.json">/rsl.json</a></p>
    <p>Contact: [email protected]</p>
</body>
</html>

The noindex meta tag prevents search engines from indexing your error page. The licensing information converts a dead-end block into a commercial opportunity. AI company engineers reviewing crawler logs see exactly how to proceed.

Throttling AI Crawlers with mod_ratelimit

Apache Rate Limiting Modules

Apache's mod_ratelimit controls bandwidth (bytes per second), not request frequency. For request-based rate limiting, mod_qos or reverse-proxy configurations provide better control.

mod_ratelimit — bandwidth throttle:

<IfModule mod_ratelimit.c>
SetEnvIfNoCase User-Agent "GPTBot" ai_crawler
SetEnvIfNoCase User-Agent "ClaudeBot" ai_crawler
SetEnvIfNoCase User-Agent "Bytespider" ai_crawler

<If "-n reqenv('ai_crawler')">
    SetOutputFilter RATE_LIMIT
    SetEnv rate-limit 50
</If>
</IfModule>

This limits AI crawlers to 50 KB/s bandwidth. A typical article page weighing 100KB takes 2 seconds to deliver instead of milliseconds. The crawler still receives content, but data extraction slows dramatically. At scale, this reduces the AI company's crawl efficiency by orders of magnitude.

Connection-Based Throttling

mod_qos provides connection and request rate limiting:

<IfModule mod_qos.c>
# Limit AI crawlers to 5 concurrent connections
QS_SrvMaxConnPerIP 5
SetEnvIfNoCase User-Agent "GPTBot" QS_Cond=ai
SetEnvIfNoCase User-Agent "ClaudeBot" QS_Cond=ai
SetEnvIfNoCase User-Agent "Bytespider" QS_Cond=ai
</IfModule>

Note: mod_qos requires server-level installation, which shared hosting typically doesn't provide. If available, it offers superior control compared to .htaccess-only approaches.

Combining Throttle + Block by Crawler Type

Apply different rules to different crawlers based on their compliance history:

# Compliant crawlers (GPTBot, ClaudeBot) — throttle, don't block
SetEnvIfNoCase User-Agent "GPTBot" compliant_crawler
SetEnvIfNoCase User-Agent "ClaudeBot" compliant_crawler

# Non-compliant crawlers (Bytespider) — block entirely
RewriteCond %{HTTP_USER_AGENT} Bytespider [NC]
RewriteRule .* - [F,L]

# Compliant crawlers get bandwidth throttle
<IfModule mod_ratelimit.c>
<If "-n reqenv('compliant_crawler')">
    SetOutputFilter RATE_LIMIT
    SetEnv rate-limit 100
</If>
</IfModule>

Bytespider gets a hard 403. GPTBot and ClaudeBot receive content at reduced bandwidth. The distinction reflects reality: compliant crawlers that honor licensing terms deserve different treatment than crawlers that disregard all publisher preferences.

IP-Based Blocking for User-Agent Spoofers

Why User-Agent Blocking Isn't Sufficient

Any HTTP client can set any user-agent string. A scraping script can identify as Googlebot, Mozilla Firefox, or anything else. Bytespider has been observed sending requests with standard browser user-agent strings to bypass user-agent-based blocking.

When a crawler lies about its identity, user-agent rules fail silently. Traffic continues. Content gets scraped. Log analysis might eventually reveal the deception through behavioral patterns, but the damage accumulates in the meantime.

Blocking Known AI Company IP Ranges

IP ranges can't be spoofed at the TCP level. Responses must route back to the source address. Blocking known AI company infrastructure blocks their crawlers regardless of user-agent claims.

# Block known Bytespider (ByteDance) IP ranges
<IfModule mod_authz_core.c>
<RequireAll>
    Require all granted
    Require not ip 220.243.135.0/24
    Require not ip 220.243.136.0/24
    Require not ip 111.225.148.0/24
    Require not ip 111.225.149.0/24
</RequireAll>
</IfModule>

# Apache 2.2 fallback
<IfModule !mod_authz_core.c>
Order Allow,Deny
Allow from all
Deny from 220.243.135.0/24
Deny from 220.243.136.0/24
Deny from 111.225.148.0/24
Deny from 111.225.149.0/24
</IfModule>

The dual-module syntax ensures compatibility across Apache 2.2 and 2.4 installations. Shared hosting environments may run either version.

Maintaining and Updating IP Block Lists

AI companies rotate infrastructure. ByteDance shifts IP ranges periodically. Static block lists decay in effectiveness over time.

Maintain your IP block list in a separate included file:

# .htaccess
Include /path/to/ai-ip-blocks.conf

Note: The Include directive may not work in .htaccess on all hosts. If restricted, maintain the IP list directly in .htaccess and schedule monthly manual reviews.

Sources for updated IP ranges:

OpenAI publishes GPTBot ranges in their documentation
Anthropic publishes ClaudeBot ranges
ByteDance does not publish ranges — community-maintained lists are the primary source
The AI crawler directory aggregates known ranges

Conditional Routing for Monetization

Redirecting AI Crawlers to Licensing Endpoints

Rather than blocking, redirect AI crawlers to a licensing page:

RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Google-Extended) [NC]
RewriteCond %{REQUEST_URI} !^/ai-licensing [NC]
RewriteCond %{REQUEST_URI} !^/rsl\.json [NC]
RewriteRule ^(.*)$ /ai-licensing?requested=$1 [R=302,L]

Crawlers requesting any content page get redirected to /ai-licensing with the originally requested URL as a parameter. The exception rules ensure crawlers can still access the licensing page itself and your RSL file — blocking those would create a dead loop.

Serving Truncated Content to Unpaid Crawlers

Deliver abbreviated content to AI crawlers while serving full content to humans and search engines:

# Route AI crawlers to excerpt versions
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider|CCBot) [NC]
RewriteCond %{REQUEST_URI} ^/articles/ [NC]
RewriteRule ^articles/(.*)$ /ai-excerpts/$1 [L]

Create an /ai-excerpts/ directory containing trimmed versions of your articles — headlines, first paragraphs, and a licensing notice. Full content remains behind the licensing wall. Crawlers receive enough to evaluate your content's worth without getting the full product.

Integrating with Cloudflare Pay-Per-Crawl

For Apache sites behind Cloudflare, .htaccess serves as the origin-level fallback. Cloudflare handles detection and billing at the edge. .htaccess catches requests that bypass the CDN — direct-to-origin requests, cached content bypasses, or CDN configuration gaps.

The recommended layering:

Cloudflare — CDN-edge AI crawler detection, Pay-Per-Crawl billing, managed rulesets
.htaccess — Origin-level blocking for requests that reach Apache directly
robots.txt — Advisory layer for compliant crawlers

Configure .htaccess to trust Cloudflare's headers:

# Trust Cloudflare's real IP header
SetEnvIf CF-Connecting-IP "." real_ip
RemoteIPHeader CF-Connecting-IP

This ensures IP-based rules evaluate the visitor's real IP address rather than Cloudflare's edge server IP.

Logging and Monitoring

Custom Log Directives for Crawler Tracking

Apache's mod_log_config supports conditional logging:

# Log AI crawler requests to a separate file
SetEnvIfNoCase User-Agent "GPTBot" ai_crawler
SetEnvIfNoCase User-Agent "ClaudeBot" ai_crawler
SetEnvIfNoCase User-Agent "Bytespider" ai_crawler
SetEnvIfNoCase User-Agent "CCBot" ai_crawler
SetEnvIfNoCase User-Agent "Google-Extended" ai_crawler
SetEnvIfNoCase User-Agent "PerplexityBot" ai_crawler

CustomLog /var/log/apache2/ai-crawlers.log combined env=ai_crawler

Note: CustomLog in .htaccess requires LogOverride to be enabled in the server configuration. Many shared hosting environments don't support this. Check with your provider or test by adding the directive and monitoring for errors.

Analyzing Blocked Request Patterns

Apache's error log records 403 responses. Filter for AI crawler blocks:

grep "403" /var/log/apache2/error.log | grep -i "gptbot\|claudebot\|bytespider"

Weekly analysis reveals trends: which crawlers attempt access most frequently, which content they target, whether new user agents appear. This data feeds pricing strategy if you shift from blocking to monetization.

Testing and Validation

Verifying Rules with curl

Test each crawler rule before deploying to production:

# Should return 403
curl -A "GPTBot/1.0" -I https://example.com/articles/test.html

# Should return 403
curl -A "ClaudeBot/1.0" -I https://example.com/articles/test.html

# Should return 200 (search engine — never block)
curl -A "Googlebot/2.1" -I https://example.com/articles/test.html

# Should return 200 (normal browser)
curl -A "Mozilla/5.0 (Windows NT 10.0)" -I https://example.com/articles/test.html

Any configuration error that blocks Googlebot will damage search rankings. Verify search engine crawlers pass through after every .htaccess modification.

Common .htaccess Errors and Fixes

500 Internal Server Error after adding rules: Check for syntax errors. A missing [NC] flag, unclosed <IfModule> block, or typo in a module name triggers 500 errors. Remove the new rules, confirm the site loads, then add rules back one block at a time.

Rules not taking effect: Verify that AllowOverride All (or at minimum AllowOverride FileInfo Options) is set in your Apache virtual host configuration. Some shared hosts restrict .htaccess capabilities. Also verify that mod_rewrite is enabled — run apache2ctl -M | grep rewrite if you have shell access.

Blocking too broadly: A regex like Bot matches Googlebot, Bingbot, and every other crawler containing "Bot" in its name. Always use specific crawler names: GPTBot, ClaudeBot, not generic patterns.

When Blocking AI Crawlers Isn't the Move

Skip this if:

Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.

Frequently Asked Questions

Does .htaccess blocking affect site performance?

Minimally. Apache evaluates .htaccess files on every request, and complex rule sets add microseconds of processing. The net performance impact is typically positive: blocking thousands of daily AI crawler requests reduces overall server load. The CPU cycles saved by not serving full pages to crawlers far exceed the cost of evaluating rewrite conditions.

Can I use .htaccess on Nginx or LiteSpeed servers?

Nginx does not process .htaccess files. Use Nginx-specific configuration instead. LiteSpeed does process .htaccess files with mod_rewrite compatibility, so Apache rules generally work on LiteSpeed hosts. OpenLiteSpeed has partial support — test each rule after deployment.

Should I block at .htaccess level or through a CDN?

Both. CDN-level blocking (Cloudflare, Fastly, other CDNs) intercepts at the network edge, saving origin bandwidth. .htaccess catches direct-to-origin requests that bypass the CDN. The layers complement each other. Neither alone covers all scenarios.

How do I handle AI crawlers that spoof their user-agent string?

User-agent blocking catches honest crawlers. IP range blocking catches dishonest ones. Combine both in your .htaccess. Add known AI company IP ranges to deny lists. For sophisticated spoofers that rotate IPs, behavioral analysis at the CDN level (Cloudflare Bot Management) provides better detection than .htaccess rules alone.

Will blocking AI crawlers break my WordPress site?

No. WordPress functions independently of AI crawler access. The only risk: accidentally blocking Googlebot (search indexer) with an overly broad rule. Use specific AI crawler names in your patterns, never generic terms like Bot or Spider that match search engine crawlers. Test with curl after every change.