Block ByteSpider with Nginx: Stop TikTok's Aggressive AI Crawler

Quick Summary

What this covers: Complete Nginx configuration guide to block ByteDance's ByteSpider crawler. Includes user-agent rules, IP blocking, and behavioral detection for spoofed requests.

Who it's for: publishers and site owners managing AI bot traffic

Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.

ByteSpider operates as ByteDance's web crawler, feeding training data to the Doubao large language model and TikTok AI features. The crawler stands out for three characteristics: massive volume, routine robots.txt non-compliance, and zero publisher compensation.

Publishers report ByteSpider generating 3-5x the request volume of GPTBot while respecting none of the courtesies other AI crawlers observe. It ignores robots.txt, spoofs user agents, and rotates through IP ranges to evade static blocks.

Nginx-based blocking provides the most effective defense. Configuration combines user-agent detection, IP range blocking, rate limiting, and behavioral analysis. No single mechanism stops ByteSpider entirely. Layered defenses catch 90-95% of requests.

Why Nginx-Level Blocking Matters for ByteSpider

robots.txt Alone Fails (ByteSpider Ignores Directives)

The robots.txt protocol depends on voluntary compliance. ByteSpider doesn't volunteer.

Publishers document consistent ByteSpider crawling after implementing:

User-agent: Bytespider
Disallow: /

Server logs show request volume unchanged. ByteSpider either doesn't check robots.txt or checks and proceeds anyway.

Compliance comparison:

Crawler	robots.txt Compliance Rate	Publisher Reports
GPTBot	~99%	Consistent compliance
ClaudeBot	~99%	Consistent compliance
Googlebot	~99%	Consistent compliance
Bytespider	~5%	Routine violations

The 5% ByteSpider "compliance" represents requests that stopped for other reasons (IP blocks, firewall rules) rather than robots.txt respect.

robots.txt should still be implemented for legal documentation purposes. The directive establishes that you expressly prohibited access. But enforcement requires server-level blocking.

Server-Level Enforcement Prevents Resource Consumption

Nginx blocking stops requests at the web server layer before they reach application code. Benefits:

Resource protection:

No PHP/Python/Ruby execution
No database queries
No CMS page generation
No framework overhead

Cost savings:

Reduced bandwidth consumption
Lower server CPU utilization
Decreased memory usage
Minimal log storage growth

Performance preservation:

Legitimate user requests get full resources
No crawler-induced slowdowns
Shared hosting limits preserved

For sites on metered hosting or experiencing ByteSpider volumes in thousands of requests daily, server-level blocking directly reduces infrastructure costs.

Nginx Performance (Minimal Overhead)

Nginx processes blocks at the request handling layer with negligible performance impact. The server evaluates user-agent strings and IP ranges before any application code executes.

Performance characteristics:

User-agent string matching: microseconds per request
IP range checking: microseconds per request
No database lookups required
No external service calls
Minimal memory allocation

Sites serving millions of requests daily can implement comprehensive ByteSpider blocking without measurable performance degradation.

Basic Nginx Configuration

User-Agent Based Blocking

The foundation of ByteSpider blocking:

map $http_user_agent $block_bytespider {
    default 0;
    ~*Bytespider 1;
    ~*bytedance 1;
}

server {
    listen 80;
    server_name yourdomain.com;

    if ($block_bytespider) {
        return 403;
    }

    # Rest of your configuration
}

Explanation:

map directive creates variable $block_bytespider
Default value: 0 (don't block)
If user-agent contains "Bytespider" (case-insensitive): set to 1
If user-agent contains "bytedance" (case-insensitive): set to 1
In server block, return 403 Forbidden when variable equals 1

Reload Nginx:

sudo nginx -t
sudo systemctl reload nginx

Match Multiple ByteDance Variants

ByteSpider appears with multiple user-agent formats:

map $http_user_agent $block_bytespider {
    default 0;
    ~*Bytespider 1;
    ~*bytedance 1;
    ~*ByteSpider 1;
    ~*BYTESPIDER 1;
}

The ~* regex operator provides case-insensitive matching, but multiple entries ensure maximum coverage across variant capitalizations.

Return 403 vs. 444 (Connection Drop)

Two response strategies:

403 Forbidden (explicit rejection):

if ($block_bytespider) {
    return 403;
}

Nginx 444 (connection drop without response):

if ($block_bytespider) {
    return 444;
}

Trade-offs:

Response	Behavior	Pros	Cons
403	HTTP error page	Clear signal to legitimate analysis	ByteSpider sees failure
444	Connection dropped	ByteSpider gets no response	Harder to debug issues

Recommendation: Use 403 during initial implementation for debugging visibility. Switch to 444 after confirming block effectiveness if you want to give ByteSpider no feedback.

IP-Based Blocking

Known ByteDance IP Ranges

ByteDance operates from documented IP ranges. Block at network level:

geo $bytedance_ip {
    default 0;

    # Known ByteDance ranges
    220.243.135.0/24 1;
    220.243.136.0/24 1;
    111.225.148.0/24 1;
    111.225.149.0/24 1;
    110.249.201.0/24 1;
    110.249.202.0/24 1;
    60.8.123.0/24 1;
    60.8.124.0/24 1;
}

server {
    if ($bytedance_ip) {
        return 403;
    }
}

Geo directive evaluates client IP against defined ranges. Matches set $bytedance_ip to 1.

ASN-Based Blocking (AS396986, AS138294)

Block entire ByteDance Autonomous System Numbers:

Requires GeoIP2 module (standard in modern Nginx builds):

map $geoip2_data_asn $block_bytedance_asn {
    default 0;
    396986 1;  # ByteDance Inc.
    138294 1;  # ByteDance
}

server {
    if ($block_bytedance_asn) {
        return 403;
    }
}

ASN blocking catches all IP ranges associated with ByteDance, including new allocations that static IP lists miss.

Install GeoIP2 if needed:

sudo apt-get install libnginx-mod-http-geoip2
# Or
sudo yum install nginx-mod-http-geoip2

Combine User-Agent and IP Blocking

Maximum coverage through layered detection:

map $http_user_agent $ua_block {
    default 0;
    ~*Bytespider 1;
    ~*bytedance 1;
}

geo $ip_block {
    default 0;
    220.243.135.0/24 1;
    220.243.136.0/24 1;
    # Additional ranges...
}

map $ua_block$ip_block $block_bytespider_combined {
    default 0;
    ~1 1;  # Block if either UA or IP matches
}

server {
    if ($block_bytespider_combined) {
        return 403;
    }
}

This blocks requests matching user-agent or IP range criteria. Catches spoofed user-agents (identified by IP) and IP rotation (identified by user-agent).

Advanced Behavioral Detection

Rate Limiting Suspicious Patterns

ByteSpider generates high request volumes. Rate limiting catches aggressive crawling even when user-agent and IP are spoofed:

limit_req_zone $binary_remote_addr zone=crawler_limit:10m rate=10r/m;

server {
    location / {
        limit_req zone=crawler_limit burst=20;

        # Rest of location config
    }
}

Configuration:

Creates rate limit zone: 10 requests per minute per IP
Burst allowance: 20 requests (accommodates legitimate traffic spikes)
Exceeding limits returns 503 (Service Unavailable)

ByteSpider hitting hundreds of requests per minute gets throttled. Legitimate users stay under limits.

Detect Absence of Common Browser Behaviors

Real browsers load CSS, JavaScript, and images. Crawlers request only HTML:

map $http_user_agent $suspicious_crawler {
    default 0;
    "~*Mozilla.*Windows.*Chrome" 1;  # Claims to be browser
}

map $request_uri $static_resource {
    default 0;
    ~*\.(css|js|jpg|png|gif|ico)$ 1;
}

log_format crawler_check '$remote_addr - $http_user_agent - Static: $static_resource';
access_log /var/log/nginx/crawler_analysis.log crawler_check;

Analysis process:

Log requests with user-agent and static resource flag
Periodically analyze: IPs claiming browser identity but never requesting static resources
Add confirmed spoofed IPs to block list

Weekly analysis script:

#!/bin/bash
# Find IPs claiming browser identity but only requesting HTML
awk '/Mozilla.*Chrome/ && /Static: 0/ {print $1}' /var/log/nginx/crawler_analysis.log | sort | uniq -c | sort -nr | head -20

IPs appearing thousands of times without static resource requests are likely spoofed ByteSpider.

Challenge-Based Detection

Serve challenge pages to suspected crawlers:

map $http_user_agent $maybe_fake_browser {
    default 0;
    ~*Mozilla 1;
}

geo $suspicious_network {
    default 0;
    220.0.0.0/8 1;  # Chinese IP ranges where ByteDance operates
}

server {
    if ($maybe_fake_browser && $suspicious_network) {
        rewrite ^(.*)$ /challenge.html break;
    }
}

Challenge page (/challenge.html):

Simple JavaScript that redirects to original URL
Crawlers without JavaScript execution capability can't pass
Real browsers from Chinese networks proceed normally

Trade-off: Adds friction for legitimate Chinese users. Appropriate if your audience is primarily non-Chinese or if ByteSpider volume justifies regional challenges.

Configuration Best Practices

Centralized Configuration File

Maintain ByteSpider rules in separate include file:

/etc/nginx/conf.d/block-bytespider.conf:

# ByteSpider user-agent blocking
map $http_user_agent $block_bytespider_ua {
    default 0;
    ~*Bytespider 1;
    ~*bytedance 1;
}

# ByteDance IP ranges
geo $block_bytespider_ip {
    default 0;
    220.243.135.0/24 1;
    220.243.136.0/24 1;
    111.225.148.0/24 1;
    111.225.149.0/24 1;
    # Additional ranges...
}

# Combined blocking variable
map $block_bytespider_ua$block_bytespider_ip $block_bytespider {
    default 0;
    ~1 1;
}

Main configuration:

include /etc/nginx/conf.d/block-bytespider.conf;

server {
    if ($block_bytespider) {
        return 403;
    }
    # Rest of server config
}

Benefits:

Single source of truth for block rules
Easy updates without editing main config
Reusable across multiple server blocks
Version control friendly

Testing Before Production Deployment

Test configuration changes before deploying:

# Test syntax
sudo nginx -t

# If successful, reload
sudo systemctl reload nginx

Nginx syntax checker catches configuration errors before they cause service interruption.

Staged deployment:

Deploy to staging/development environment
Verify ByteSpider requests are blocked
Confirm legitimate traffic unaffected
Deploy to production

Logging Blocked Requests

Track block effectiveness:

map $http_user_agent $block_bytespider {
    default 0;
    ~*Bytespider 1;
    ~*bytedance 1;
}

server {
    if ($block_bytespider) {
        access_log /var/log/nginx/bytespider_blocked.log combined;
        return 403;
    }
}

Dedicated log file captures all blocked ByteSpider requests. Weekly analysis reveals:

Block success rate
New IP ranges requiring addition to block list
Spoofing patterns
Volume trends

Monitoring and Maintenance

Weekly Log Analysis

Verify blocks remain effective:

# Count ByteSpider requests (should be low/zero)
grep "Bytespider" /var/log/nginx/access.log | wc -l

# Count blocked requests
wc -l /var/log/nginx/bytespider_blocked.log

# Identify new IP ranges
grep "Bytespider" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -nr

Alert thresholds:

More than 50 ByteSpider requests per week reaching content pages: indicates block failure
New IP ranges not in block list: requires config update
Sudden volume spikes: investigate new crawler variants

Update IP Ranges Quarterly

ByteDance infrastructure evolves. Quarterly updates prevent block list decay:

Update process:

Research current ByteDance IP allocations (RIPE, ARIN databases)
Cross-reference with community-maintained lists
Add new ranges to geo block
Test configuration
Deploy updates

Community resources:

https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker
Publisher forums discussing ByteSpider activity
Network operator mailing lists

Alert on Block Failures

Automated alerting for unusual ByteSpider activity:

#!/bin/bash
# Alert if ByteSpider bypasses blocks

LOG="/var/log/nginx/access.log"
THRESHOLD=50
COUNT=$(grep "Bytespider" "$LOG" | grep -v "403" | wc -l)

if [ $COUNT -gt $THRESHOLD ]; then
    echo "ByteSpider block failure: $COUNT successful requests detected" | mail -s "Nginx Alert" [email protected]
fi

Run via cron daily or weekly.

Performance Impact Assessment

Nginx Block Overhead

Measure blocking impact:

Benchmark without blocks:

ab -n 1000 -c 10 https://yourdomain.com/

Benchmark with blocks:

ab -n 1000 -c 10 https://yourdomain.com/

Expected result: No measurable difference. Nginx user-agent and IP evaluation adds microseconds per request. Application-level processing dominates response time.

Real-world impact: Blocking reduces total request volume, improving overall performance by freeing resources for legitimate traffic.

Monitor Server Resource Utilization

Track CPU and memory before/after implementation:

# Before
top -b -n 1 | head -20

# After
top -b -n 1 | head -20

Expected outcome: CPU and memory utilization decrease as ByteSpider requests are rejected before resource-intensive processing.

Complete Production Configuration Example

Full-featured ByteSpider blocking configuration:

# /etc/nginx/conf.d/bytespider-block.conf

# User-agent detection
map $http_user_agent $block_bytespider_ua {
    default 0;
    ~*Bytespider 1;
    ~*bytedance 1;
}

# IP range blocking
geo $block_bytespider_ip {
    default 0;
    220.243.135.0/24 1;
    220.243.136.0/24 1;
    111.225.148.0/24 1;
    111.225.149.0/24 1;
    110.249.201.0/24 1;
    110.249.202.0/24 1;
    60.8.123.0/24 1;
    60.8.124.0/24 1;
}

# Combined blocking logic
map $block_bytespider_ua$block_bytespider_ip $block_bytespider {
    default 0;
    ~1 1;
}

# Rate limiting for aggressive crawlers
limit_req_zone $binary_remote_addr zone=bytespider_limit:10m rate=10r/m;

Main server config:

server {
    listen 443 ssl http2;
    server_name yourdomain.com;

    # Include ByteSpider blocks
    include /etc/nginx/conf.d/bytespider-block.conf;

    # Block ByteSpider
    if ($block_bytespider) {
        access_log /var/log/nginx/bytespider_blocked.log combined;
        return 403 "Access denied";
    }

    # Apply rate limiting
    location / {
        limit_req zone=bytespider_limit burst=20 nodelay;

        # Rest of location config
        try_files $uri $uri/ =404;
    }
}

This configuration provides:

User-agent blocking (catches honest ByteSpider)
IP blocking (catches spoofed user-agents)
Rate limiting (catches IP rotation)
Dedicated logging (monitoring and analysis)
Minimal performance overhead

When Blocking AI Crawlers Isn't the Move

Skip this if:

Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.

Frequently Asked Questions

Does Nginx blocking work if ByteSpider spoofs user-agents?

User-agent blocking alone doesn't catch spoofed requests. Layered defense combining user-agent detection and IP range blocking catches both honest ByteSpider (user-agent) and spoofed ByteSpider (IP). Add rate limiting to catch sophisticated spoofing that evades both.

Will blocking ByteSpider affect my TikTok presence?

No. ByteSpider web crawling is separate from TikTok social platform features. Blocking ByteSpider doesn't affect how your content appears in TikTok search, link previews, or social sharing. These use different infrastructure.

Can I use Nginx blocking if I'm on shared hosting?

Shared hosting typically doesn't provide Nginx configuration access. Use robots.txt (minimal effectiveness) or Cloudflare WAF rules (more effective). If you have VPS or dedicated server with Nginx, server-level blocking is optimal.

How do I verify my Nginx blocks are working?

Check server logs for ByteSpider requests: grep "Bytespider" /var/log/nginx/access.log. After implementing blocks, you should see only 403 responses or near-zero requests. Monitor weekly to confirm sustained effectiveness.

Should I return 403 or 444 for blocked requests?

403 Forbidden provides explicit rejection visible in logs and helpful for debugging. 444 (connection drop) gives ByteSpider no feedback. Use 403 initially for visibility. Switch to 444 after confirming blocks work if you want to provide zero feedback to crawler.

Does blocking ByteSpider violate any regulations?

No. Publishers control access to their content. robots.txt protocol is voluntary. You're not required to allow any crawler access. Blocking ByteSpider is legal, ethical, and increasingly common among publishers tired of uncompensated extraction.

How often should I update ByteDance IP ranges?

Quarterly updates balance maintenance overhead with coverage effectiveness. ByteDance infrastructure evolves but not constantly. Set calendar reminder to review and update IP ranges every 90 days.

Can ByteSpider bypass Nginx blocks entirely?

Sophisticated crawlers can rotate IPs, spoof user-agents, and mimic browser behavior. Layered defenses (user-agent + IP + rate limiting + behavioral detection) catch 90-95% of requests. Perfect blocking is impossible against determined adversaries, but comprehensive Nginx configuration stops the vast majority of ByteSpider activity.