Block PerplexityBot in robots.txt: Stop Controversial AI Crawler
Quick Summary
- What this covers: Block Perplexity's crawler using robots.txt directives. Includes controversy background, compliance verification, and server-level enforcement methods.
- Who it's for: publishers and site owners managing AI bot traffic
- Key takeaway: Read the first section for the core framework, then use the specific tactics that match your situation.
Perplexity operates PerplexityBot to crawl content for its AI-powered search product. The crawler generated substantial controversy in mid-2024 when publishers documented systematic robots.txt violations and paywall bypassing.
Wired, Forbes, and other major publishers reported PerplexityBot accessing paywalled content, ignoring explicit robots.txt blocks, and operating through undisclosed IP ranges to evade detection. Perplexity's public positioning as "respectful of publisher preferences" contradicted documented behavior.
The company subsequently improved compliance. Current PerplexityBot behavior shows better robots.txt adherence than 2024 patterns. However, the documented willingness to violate publisher directives when convenient justifies defensive blocking configurations.
This guide covers robots.txt blocking, server-level enforcement for additional protection, and monitoring to detect any compliance regression.
The PerplexityBot Controversy
2024 Documentation of robots.txt Violations
Multiple publishers documented PerplexityBot crawling despite explicit robots.txt blocks throughout 2024:
Wired investigation (June 2024):
- Implemented
User-agent: PerplexityBot/Disallow: / - Documented continued crawling from Perplexity infrastructure
- Identified undisclosed crawler using different user-agent strings
- Published technical evidence of systematic violations
Forbes reporting (Summer 2024):
- Similar robots.txt blocks ignored
- Paywall content accessed without authentication
- Perplexity responses incorporated paywalled content
- No compensation or licensing agreements in place
Publisher coalition response:
- Trade associations called for Perplexity accountability
- Legal teams evaluated copyright infringement claims
- Industry forums shared blocking strategies
- Pressure campaign for improved compliance
Perplexity's compliance improved following public scrutiny. The question facing publishers: was improvement genuine commitment or tactical response to negative attention?
Paywall Bypassing Allegations
Beyond robots.txt violations, publishers documented Perplexity accessing content behind authentication walls:
Techniques identified:
- Exploiting preview/teaser content to reconstruct full articles
- Using cached versions to bypass current paywalls
- Accessing subscriber-only content through unclear methods
- Incorporating paywalled material into AI responses without attribution
Premium publishers with strict paywalls (Financial Times, Wall Street Journal, regional newspapers) observed their paywalled content appearing in Perplexity responses without licensing agreements.
The behavior crossed from aggressive crawling into potential copyright infringement and unauthorized access to computer systems. Legal implications remain unsettled but the pattern justified publisher alarm.
Improved Compliance (But Trust Deficit Remains)
Perplexity made changes following controversy:
Technical improvements:
- Better robots.txt compliance monitoring
- More consistent user-agent identification
- Reduced reports of paywall bypassing
- Public commitment to publisher respect
Business initiatives:
- Revenue-sharing program announced
- Publisher partnerships announced
- Improved documentation and contact channels
However: Documented willingness to violate publisher directives when convenient creates lasting trust deficit. Publishers implementing defensive blocks act rationally based on demonstrated behavior, regardless of current stated commitments.
robots.txt Configuration
Basic Block Directive
Standard PerplexityBot block:
User-agent: PerplexityBot
Disallow: /
This instructs PerplexityBot to avoid all paths. Place in robots.txt at domain root.
Current compliance: Publisher reports from late 2025 onward suggest PerplexityBot now honors robots.txt directives approximately 90-95% of the time. Substantial improvement from 2024 patterns but not perfect.
Block All Perplexity Crawler Variants
Perplexity operates multiple crawlers for different purposes:
# Block main crawler
User-agent: PerplexityBot
Disallow: /
# Block any Perplexity variants
User-agent: Perplexity
Disallow: /
The second entry catches potential alternative crawlers using "Perplexity" without "Bot" suffix.
Complete Multi-Crawler Configuration
Comprehensive robots.txt managing multiple AI crawlers:
# Block Perplexity (history of violations)
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity
Disallow: /
# Block other AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: cohere-ai
Disallow: /
# Allow search engines
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: *
Disallow: /admin/
Disallow: /private/
This blocks AI crawlers with problematic track records while preserving search engine access.
Server-Level Enforcement
Why Server Blocking Matters for PerplexityBot
Perplexity's documented history of robots.txt violations justifies server-level enforcement. robots.txt establishes terms but doesn't prevent access. Server rules enforce terms by rejecting non-compliant requests.
Defense in depth philosophy: robots.txt documents your terms, server blocking enforces them. Both layers together provide maximum protection against crawlers with compliance issues.
Nginx Configuration
map $http_user_agent $block_perplexity {
default 0;
~*PerplexityBot 1;
~*Perplexity 1;
}
server {
if ($block_perplexity) {
return 403;
}
# Rest of configuration
}
Returns 403 Forbidden for PerplexityBot requests. No reliance on voluntary compliance.
Reload Nginx:
sudo nginx -t
sudo systemctl reload nginx
Apache .htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} PerplexityBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Perplexity [NC]
RewriteRule .* - [F,L]
Achieves same result via Apache rewrite rules.
Cloudflare WAF Rules
Block at CDN edge:
- Navigate to Security > WAF > Custom Rules
- Create rule:
(http.user_agent contains "PerplexityBot") or (http.user_agent contains "Perplexity") - Action: Block
- Deploy globally
Edge blocking stops requests before origin servers see them. Zero bandwidth consumed. Zero resources spent.
IP-Based Blocking
Known Perplexity IP Ranges
Perplexity operates from cloud infrastructure. Community-documented ranges:
34.82.0.0/16
35.188.0.0/16
These represent Google Cloud Platform ranges where Perplexity infrastructure operates.
Nginx IP blocking:
geo $perplexity_ip {
default 0;
34.82.0.0/16 1;
35.188.0.0/16 1;
}
server {
if ($perplexity_ip) {
return 403;
}
}
Caveat: These ranges host many services. IP-only blocking risks false positives. Use combined detection (user-agent + IP) for accuracy.
Verify Legitimate Perplexity Requests
Confirm requests claiming Perplexity identity actually originate from Perplexity:
# Reverse DNS
host 34.82.123.45
# Forward verification
dig +short <hostname>
Legitimate Perplexity requests resolve to Google Cloud infrastructure. Spoofed requests fail verification.
Monitoring and Verification
Check Server Logs for Compliance
Verify PerplexityBot honors blocks:
grep "Perplexity" /var/log/nginx/access.log | tail -50
Expected result after blocking: Zero successful (200) responses to content pages. Only 403 responses or robots.txt checks should appear.
Alert on Block Failures
Automated monitoring detects compliance issues:
#!/bin/bash
LOG="/var/log/nginx/access.log"
THRESHOLD=20
COUNT=$(grep "Perplexity" "$LOG" | grep -v "403" | wc -l)
if [ $COUNT -gt $THRESHOLD ]; then
echo "PerplexityBot block failure: $COUNT requests" | mail -s "Crawler Alert" [email protected]
fi
Run daily via cron. Alerts trigger if successful Perplexity requests exceed threshold.
Monitor for New Crawler Variants
Perplexity may introduce new user-agent strings. Stay current through:
- Publisher forums discussing new variants
- Web crawler directories
- Log analysis for unidentified bots from Perplexity IP ranges
Quarterly review of unidentified bot traffic catches new variants before they extract significant content.
Should You Block PerplexityBot?
Arguments for Blocking
Documented violations: 2024 behavior demonstrated willingness to violate robots.txt when convenient. Trust, once broken, requires time to rebuild.
No licensing: Perplexity announced revenue-sharing programs but hasn't signed large-scale licensing deals comparable to OpenAI or Anthropic. Current compensation infrastructure remains unclear.
Competitive product: Perplexity positions as search alternative. Content extracted from publishers powers product competing with publisher direct traffic.
Risk mitigation: Blocking prevents extraction while Perplexity proves sustained compliance commitment. You can unblock later if behavior warrants.
Arguments for Allowing
Improved compliance: Late 2025 reports show better robots.txt adherence. Current behavior differs from 2024 patterns.
Revenue-sharing program: Perplexity announced publisher revenue programs. Blocking precludes participation.
Search traffic potential: Perplexity positions as search engine. Allowing crawling preserves discoverability in their platform.
Second chance philosophy: All companies make mistakes. Improved behavior deserves recognition.
Recommended Approach
Conservative strategy: Block now, monitor Perplexity evolution, reconsider in 6-12 months. The documented violations justify defensive positioning. If Perplexity demonstrates sustained compliance and builds credible compensation infrastructure, blocking can be reversed.
Aggressive strategy: Permanent block until Perplexity signs direct licensing agreement. Don't reward past violations with free access.
Risk-tolerant strategy: Allow with enhanced monitoring. Implement server-level blocks ready to activate immediately if compliance failures recur.
Most publishers favor conservative approach: block until Perplexity proves sustained trustworthiness.
When Blocking AI Crawlers Isn't the Move
Skip this if:
- Your site has less than 1,000 monthly organic visits. AI crawlers aren't your problem — getting indexed by traditional search is. Focus on content quality and link acquisition before worrying about bot management.
- You're running a personal blog or portfolio site. AI citation of your content is free exposure at this scale. Blocking crawlers costs you visibility without protecting meaningful revenue.
- Your revenue comes entirely from direct sales, not content. If your content isn't the product (e-commerce, SaaS with no content moat), AI crawlers are neutral. Your competitive advantage lives in the product, not the pages.
Frequently Asked Questions
Has Perplexity stopped violating robots.txt?
Reports from late 2025 show improved compliance (90-95% adherence). Substantial improvement from 2024 documented violations. However, perfect compliance hasn't been demonstrated. Server-level enforcement provides protection against any residual or future violations.
Will blocking hurt my presence in Perplexity search results?
Perplexity positions as search alternative to Google. Blocking prevents content indexing and thus removes you from Perplexity results. Trade-off: lose potential Perplexity traffic but protect content from unauthorized extraction. Most publishers conclude protection outweighs traffic from platform with compliance issues.
Does Perplexity pay publishers for content?
Perplexity announced publisher revenue-sharing programs but details remain limited. No large-scale licensing deals comparable to News Corp-OpenAI have been announced. Compensation infrastructure exists in announced form but hasn't scaled to industry-standard licensing model.
Should I trust Perplexity's improved behavior?
Trust requires sustained demonstration over time. Perplexity's late 2025 behavior shows improvement. Whether improvement reflects genuine commitment or response to pressure remains unclear. Conservative publishers wait 12+ months of consistent compliance before removing defensive blocks.
Can Perplexity bypass robots.txt blocks?
2024 documentation showed Perplexity bypassing robots.txt through undisclosed crawlers and alternative IP ranges. Current reports suggest these techniques have been discontinued. Server-level blocking prevents bypassing regardless of Perplexity techniques — requests matching criteria are rejected before content is served.
What happens if I block Perplexity but competitors don't?
Your content doesn't appear in Perplexity responses. Competitor content does. Perplexity users see competitor information preferentially. Trade-off depends on whether you view Perplexity traffic as valuable opportunity or competitive threat. Publishers with premium content behind paywalls typically favor blocking (protecting valuable assets). Publishers with free content competing for attention might favor allowing (maximizing distribution).
How do I unblock Perplexity later if I change my mind?
Remove robots.txt directive and server-level rules. Changes take effect within 24-48 hours as Perplexity rechecks your robots.txt. Unblocking is straightforward — blocking creates no permanent consequences. Conservative publishers block first, observe Perplexity industry behavior, unblock later if warranted.
Does blocking violate any laws or regulations?
No. Publishers control access to their content. robots.txt and server-level blocking are standard web practices. You're not required to allow crawler access. Perplexity has no legal entitlement to crawl your domain. Blocking is legal, ethical, and increasingly common.