Support FAQ

How to Block AI Crawlers

Back to Bots

How do you block AI crawlers?

Blocking AI crawlers means preventing or controlling automated requests from AI training crawlers, AI search crawlers, live retrieval bots, and agentic workflows. The right action is not always a hard block. Some AI traffic may help visibility or represent a user request; other traffic may copy content, overload origin, or ignore site rules.

The practical approach is to define policy by crawler type, then enforce it with multiple controls.

Step 1: Decide what should be allowed

Start with the business decision:

Traffic type Typical action
Approved search crawlers Verify and allow
AI training crawlers Block unless approved
AI search crawlers Allow, block, or rate-limit based on visibility goals
Live retrieval crawlers Often allow on public pages with rate controls
Unknown AI-like traffic Challenge or rate-limit until verified
Spoofed or evasive scrapers Block

This prevents accidental overblocking. For example, a publisher may block training crawlers but allow live retrieval crawlers that bring citations or user-driven discovery.

Step 2: Publish crawler preferences in robots.txt

The robots.txt file is the most visible place to state crawler preferences. A simple block looks like this:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

This is useful, but it is not a security control. robots.txt is advisory. Well-behaved crawlers may follow it; malicious or evasive crawlers can ignore it.

Use robots.txt to express preference, then enforce important policies at the edge, WAF, reverse proxy, or application layer.

Step 3: Block by user agent for basic enforcement

Server-side user-agent rules can stop simple crawlers. For example, an Nginx rule can reject known names:

if ($http_user_agent ~* "(GPTBot|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider)") {
    return 403;
}

This is easy to deploy and easy to bypass. A crawler can change its user-agent string, so user-agent blocks should be treated as a baseline control rather than the final defense.

Step 4: Use WAF and bot-management policy

A stronger policy combines the user-agent with request evidence:

  • Known crawler name
  • Verified or unverified source network
  • TLS and HTTP/2 fingerprint
  • Browser automation indicators
  • Residential proxy or datacenter origin
  • Request rate and route depth
  • Sensitive route access, such as pricing, search, catalogue, or API paths

With that evidence, the response can be more precise:

Evidence Response
Verified crawler, low rate, approved route Allow
Known training crawler, no approval Block
AI crawler on expensive search/API routes Rate-limit
Unknown crawler with browser inconsistencies Challenge
Spoofed crawler using proxy rotation Block
User-driven retrieval on public pages Allow with monitoring

This avoids blanket rules that block valuable traffic and miss spoofed traffic at the same time.

Step 5: Rate-limit noisy crawlers

Rate limiting is useful when a crawler has some value but is requesting too much. Good rate limits can be route-aware:

  • Lower limits for search, listing, catalogue, and API routes
  • Higher limits for static public article pages
  • Separate limits for verified search crawlers and unknown crawlers
  • Escalation from allow to slow, challenge, or block

Simple IP-based limits are not enough against residential proxy rotation. Better limits use session, fingerprint, route, and behavior signals as well as source IP.

Step 6: Protect high-value content paths

AI crawlers often target high-value routes:

  • Product pages
  • Pricing pages
  • Inventory and availability endpoints
  • Search results
  • Documentation
  • Articles and research
  • Reviews and user-generated content
  • Public APIs

Apply stricter controls to routes where scraping creates commercial harm or origin cost. Some sites allow AI crawlers to see marketing pages while blocking product, price, and search routes.

Step 7: Review the policy continuously

AI crawler names, operators, and behavior change quickly. Review:

  • Which AI crawlers are visiting
  • Whether they follow robots.txt
  • Which routes they request
  • Whether blocked traffic reappears with another user-agent
  • Whether allowlisted crawlers still provide value
  • Whether crawler traffic affects origin load or analytics

Blocking AI crawlers is not a one-time rule. It is an ongoing governance and security process.

Recommended Peakhour control model

Peakhour's recommended model is:

  1. Publish preferences with robots.txt.
  2. Detect known AI crawler names in logs.
  3. Verify trusted crawlers instead of trusting headers alone.
  4. Score risk using fingerprints, proxy signals, route mix, and cadence.
  5. Apply the least disruptive effective action: allow, rate-limit, challenge, or block.
  6. Keep evidence attached so teams can review why a crawler was controlled.

For the detection workflow, see how to detect AI crawlers. For names to watch, see AI crawler user agents.

Related Articles

AI Crawler User Agents

A practical reference for common AI crawler user agents, operators, purposes, and recommended Peakhour bot-management actions.

AI For Cybersecurity

AI For Cybersecurity explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.

AI Image Generation

AI Image Generation explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.

AI Misuse

AI Misuse explains the concept in the context of AI security, with practical checks and mitigation considerations for site operators.

© PEAKHOUR.IO PTY LTD 2025   ABN 76 619 930 826    All rights reserved.