How to Block AI Crawlers

How do you block AI crawlers?

Blocking AI crawlers means preventing or controlling automated requests from AI training crawlers, AI search crawlers, live retrieval bots, and agentic workflows. The right action is not always a hard block. Some AI traffic may help visibility or represent a user request; other traffic may copy content, overload origin, or ignore site rules.

The practical approach is to define policy by crawler type, then enforce it with multiple controls.

Step 1: Decide what should be allowed

Start with the business decision:

Traffic type	Typical action
Approved search crawlers	Verify and allow
AI training crawlers	Block unless approved
AI search crawlers	Allow, block, or rate-limit based on visibility goals
Live retrieval crawlers	Often allow on public pages with rate controls
Unknown AI-like traffic	Challenge or rate-limit until verified
Spoofed or evasive scrapers	Block

This prevents accidental overblocking. For example, a publisher may block training crawlers but allow live retrieval crawlers that bring citations or user-driven discovery.

Step 2: Publish crawler preferences in robots.txt

The robots.txt file is the most visible place to state crawler preferences. A simple block looks like this:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

This is useful, but it is not a security control. robots.txt is advisory. Well-behaved crawlers may follow it; malicious or evasive crawlers can ignore it.

Use robots.txt to express preference, then enforce important policies at the edge, WAF, reverse proxy, or application layer.

Step 3: Block by user agent for basic enforcement

Server-side user-agent rules can stop simple crawlers. For example, an Nginx rule can reject known names:

if ($http_user_agent ~* "(GPTBot|ClaudeBot|anthropic-ai|PerplexityBot|CCBot|Bytespider)") {
    return 403;
}

This is easy to deploy and easy to bypass. A crawler can change its user-agent string, so user-agent blocks should be treated as a baseline control rather than the final defense.

Step 4: Use WAF and bot-management policy

A stronger policy combines the user-agent with request evidence:

Known crawler name
Verified or unverified source network
TLS and HTTP/2 fingerprint
Browser automation indicators
Residential proxy or datacenter origin
Request rate and route depth
Sensitive route access, such as pricing, search, catalogue, or API paths

With that evidence, the response can be more precise:

Evidence	Response
Verified crawler, low rate, approved route	Allow
Known training crawler, no approval	Block
AI crawler on expensive search/API routes	Rate-limit
Unknown crawler with browser inconsistencies	Challenge
Spoofed crawler using proxy rotation	Block
User-driven retrieval on public pages	Allow with monitoring

This avoids blanket rules that block valuable traffic and miss spoofed traffic at the same time.

Step 5: Rate-limit noisy crawlers

Rate limiting is useful when a crawler has some value but is requesting too much. Good rate limits can be route-aware:

Lower limits for search, listing, catalogue, and API routes
Higher limits for static public article pages
Separate limits for verified search crawlers and unknown crawlers
Escalation from allow to slow, challenge, or block

Simple IP-based limits are not enough against residential proxy rotation. Better limits use session, fingerprint, route, and behavior signals as well as source IP.

Step 6: Protect high-value content paths

AI crawlers often target high-value routes:

Product pages
Pricing pages
Inventory and availability endpoints
Search results
Documentation
Articles and research
Reviews and user-generated content
Public APIs

Apply stricter controls to routes where scraping creates commercial harm or origin cost. Some sites allow AI crawlers to see marketing pages while blocking product, price, and search routes.

Step 7: Review the policy continuously

AI crawler names, operators, and behavior change quickly. Review:

Which AI crawlers are visiting
Whether they follow robots.txt
Which routes they request
Whether blocked traffic reappears with another user-agent
Whether allowlisted crawlers still provide value
Whether crawler traffic affects origin load or analytics

Blocking AI crawlers is not a one-time rule. It is an ongoing governance and security process.

Recommended Peakhour control model

Peakhour's recommended model is:

Publish preferences with robots.txt.
Detect known AI crawler names in logs.
Verify trusted crawlers instead of trusting headers alone.
Score risk using fingerprints, proxy signals, route mix, and cadence.
Apply the least disruptive effective action: allow, rate-limit, challenge, or block.
Keep evidence attached so teams can review why a crawler was controlled.

For the detection workflow, see how to detect AI crawlers. For names to watch, see AI crawler user agents.

How to Block AI Crawlers

How do you block AI crawlers?

Step 1: Decide what should be allowed

Step 2: Publish crawler preferences in robots.txt

Step 3: Block by user agent for basic enforcement

Step 4: Use WAF and bot-management policy

Step 5: Rate-limit noisy crawlers

Step 6: Protect high-value content paths

Step 7: Review the policy continuously

Recommended Peakhour control model

Related Articles

What is an Account-Control Surface?

How to defend against Account Takeovers

What is an Account Takeover?

AI Crawler User Agents

AI For Cybersecurity

AI Image Generation