Scroll for more

Robots.txt for AI Crawlers: The Complete 2026 Guide

← Back to Blog
AI crawlers like GPTBot, ClaudeBot, and PerplexityBot are visiting your site. Your robots.txt determines whether they can index your content for AI search results. Here's how to configure it correctly in 2026.
Share:

Robots.txt for AI Crawlers: The Complete 2026 Guide

Your robots.txt file is the front door to your website for every crawler on the internet. In 2026, a new generation of AI crawlers from OpenAI, Anthropic, Perplexity, and others is knocking on that door. Whether you let them in determines whether your content appears in AI-generated answers, citations, and search results across ChatGPT, Claude, Perplexity, and Google AI Overviews.

This guide covers every AI crawler active in 2026, the exact robots.txt syntax to allow or block each one, and complete configuration templates you can copy into your own site.

TL;DR

If you want AI search visibility, you need to allow AI crawlers in your robots.txt. If you want to protect content from training data collection, you can selectively block training-only crawlers while keeping citation crawlers active.

Quick-reference table:

Goal Crawlers to Allow Crawlers to Block
Maximum AI visibility All AI crawlers None
Citations only (no training) ChatGPT-User, OAI-SearchBot, PerplexityBot, ClaudeBot, Applebot-Extended GPTBot, Google-Extended, CCBot, Bytespider
Block all AI crawling None All AI crawlers

The bottom line: Most businesses wanting AI citations should allow at minimum GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, and PerplexityBot. Blocking these crawlers makes your content invisible to the AI platforms that drive the Citation Economy.


Which AI Crawlers Exist in 2026

Twelve major AI crawlers are actively scanning the web. Each serves a different purpose, and understanding the distinction between training crawlers and citation crawlers is critical for making informed robots.txt decisions.

Crawler Operator Purpose Blocking Affects Citations?
GPTBot OpenAI Training data collection for GPT models Yes -- also used for search indexing
ChatGPT-User OpenAI Real-time web browsing when users ask ChatGPT to search Yes -- blocks live browsing results
OAI-SearchBot OpenAI SearchGPT / ChatGPT search feature indexing Yes -- blocks SearchGPT citations
ClaudeBot Anthropic Web content indexing for Claude Yes -- blocks Claude web citations
PerplexityBot Perplexity AI Real-time search indexing for Perplexity answers Yes -- blocks Perplexity citations
Google-Extended Google Training data for Gemini AI models No -- does not affect Google Search
GoogleOther Google Miscellaneous Google crawling (research, one-off tasks) No -- separate from search indexing
Bytespider ByteDance Training data for TikTok/ByteDance AI models No -- no public citation product
CCBot Common Crawl Open web archive used by many AI companies for training No -- indirect training data only
FacebookBot Meta Content indexing for Meta AI features Partially -- affects Meta AI answers
Applebot-Extended Apple Training data for Apple Intelligence features Partially -- affects Siri and Apple AI
cohere-ai Cohere Training data for Cohere language models No -- enterprise-focused, no consumer citations

Understanding Each Crawler

GPTBot is OpenAI's primary web crawler. It collects content used both for model training and for powering ChatGPT's web search features. Blocking GPTBot is the single most impactful robots.txt decision you can make for AI visibility, because it affects both training and real-time citation. OpenAI respects robots.txt directives for GPTBot.

ChatGPT-User is the user-agent string sent when ChatGPT browses the web in real time during a conversation. When a user asks ChatGPT to look something up, this crawler fetches the page. Blocking it prevents ChatGPT from accessing your content during live browsing sessions.

OAI-SearchBot powers OpenAI's dedicated search product. It indexes content specifically for search queries routed through ChatGPT's search mode. Blocking it removes your content from SearchGPT results.

ClaudeBot is Anthropic's web crawler for Claude. It indexes content that Claude can reference when answering questions with web access. Anthropic has stated that ClaudeBot respects robots.txt.

PerplexityBot indexes content for Perplexity AI, the search-focused AI platform that provides cited answers to user queries. Perplexity is one of the fastest-growing AI search platforms, and its citations drive measurable referral traffic.

Google-Extended is separate from Googlebot (which powers Google Search). Blocking Google-Extended prevents your content from being used for Gemini AI training but does not affect your Google Search rankings or visibility in Google AI Overviews. This is a common point of confusion.

Bytespider is ByteDance's aggressive crawler, known for high crawl rates. It collects data for ByteDance's AI products. Many site operators block it due to high server load and limited citation benefit.

CCBot powers Common Crawl, a nonprofit that maintains an open web archive. Many AI companies use Common Crawl data for model training. Blocking CCBot reduces your content's presence in training datasets but has no direct citation impact.

FacebookBot crawls content for Meta's AI features, including Meta AI in WhatsApp, Instagram, and Facebook. Blocking it may reduce visibility in Meta's AI-generated answers.

Applebot-Extended is separate from standard Applebot (which powers Safari suggestions and Siri web results). The Extended variant specifically collects training data for Apple Intelligence. Blocking it does not affect standard Apple Search or Siri functionality.

cohere-ai crawls for Cohere, an enterprise AI company. Its impact is primarily on enterprise search products rather than consumer-facing citations.


How to Allow or Block AI Crawlers in Robots.txt

The robots.txt file lives at the root of your domain (https://yoursite.com/robots.txt). It uses a simple syntax of User-agent and Allow/Disallow directives.

Basic Syntax

To allow a specific AI crawler:

User-agent: GPTBot
Allow: /

To block a specific AI crawler:

User-agent: GPTBot
Disallow: /

To allow a crawler but block specific paths (e.g., allow your blog but block private pages):

User-agent: GPTBot
Allow: /blog/
Allow: /press-releases/
Disallow: /admin/
Disallow: /account/
Disallow: /api/

Important Rules

  1. Each crawler needs its own User-agent block. You cannot list multiple crawlers under one User-agent directive.
  2. More specific paths take precedence. If you Disallow: / but Allow: /blog/, the crawler can access /blog/ but nothing else.
  3. Robots.txt is a suggestion, not enforcement. Well-behaved crawlers (GPTBot, ClaudeBot, PerplexityBot) respect it. Malicious crawlers ignore it.
  4. Changes take effect on the next crawl. There is no instant cache invalidation. It may take days or weeks for crawlers to re-read your robots.txt.
  5. The wildcard * matches all crawlers. A User-agent: * block with Disallow: / blocks everything, including AI crawlers, unless you add specific allow rules for named crawlers above it.

Testing Your Configuration

Verify that AI crawlers can access your site:

curl -A "GPTBot/1.0" https://yoursite.com/robots.txt
curl -A "ClaudeBot/1.0" https://yoursite.com/robots.txt
curl -A "PerplexityBot/1.0" https://yoursite.com/robots.txt

Use Pressonify's AI Visibility Checker to test your robots.txt against all major AI crawlers simultaneously.


Sample Robots.txt Configurations

Below are three complete, copy-paste-ready robots.txt templates. Choose the one that matches your business goals.

Configuration 1: Permissive (Allow All AI Crawlers)

Best for businesses that want maximum AI visibility -- appearing in ChatGPT, Claude, Perplexity, Google AI Overviews, and all other AI search products.

# ============================================
# Robots.txt - Permissive AI Configuration
# Maximum AI visibility and citation potential
# ============================================

# Standard search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# OpenAI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic
User-agent: ClaudeBot
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google AI
User-agent: Google-Extended
Allow: /

User-agent: GoogleOther
Allow: /

# Apple Intelligence
User-agent: Applebot-Extended
Allow: /

# Meta AI
User-agent: FacebookBot
Allow: /

# Other AI crawlers
User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

User-agent: cohere-ai
Allow: /

# Default rule for all other crawlers
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

Configuration 2: Selective (Allow Citation Crawlers, Block Training-Only)

Best for businesses that want AI citations without contributing to training datasets. This allows crawlers that power real-time search and citation features while blocking crawlers used primarily for model training.

# ============================================
# Robots.txt - Selective AI Configuration
# Allow citations, block training-only crawlers
# ============================================

# Standard search engines
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# OpenAI - allow search/citation crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# OpenAI - block training crawler
User-agent: GPTBot
Disallow: /

# Anthropic - allow for citations
User-agent: ClaudeBot
Allow: /

# Perplexity - allow for citations
User-agent: PerplexityBot
Allow: /

# Google AI training - block
User-agent: Google-Extended
Disallow: /

User-agent: GoogleOther
Disallow: /

# Apple Intelligence - allow
User-agent: Applebot-Extended
Allow: /

# Meta AI - allow
User-agent: FacebookBot
Allow: /

# Training-only crawlers - block
User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

# Default
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

Important caveat: Blocking GPTBot while allowing ChatGPT-User and OAI-SearchBot creates a partial configuration. OpenAI may still use GPTBot for search indexing, so blocking it could reduce your SearchGPT visibility even if OAI-SearchBot is allowed. Monitor your AI citation metrics after making this change.

Configuration 3: Restrictive (Block All AI Crawlers)

For businesses that want to prevent all AI crawling -- typically used by publishers concerned about content licensing, or sites with proprietary content.

# ============================================
# Robots.txt - Restrictive AI Configuration
# Block all AI crawlers
# ============================================

# Standard search engines - still allowed
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# Block all AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: GoogleOther
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

# Default
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /account/

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

Be aware: This configuration makes your content invisible to AI search platforms. You will not receive citations from ChatGPT, Claude, Perplexity, or any AI-powered answer engine. Traditional Google and Bing search results are unaffected.


Robots.txt vs ai.txt vs llms.txt -- Which Do You Need?

Three files now govern how AI systems interact with your website. They serve different purposes and are complementary, not interchangeable.

Feature robots.txt llms.txt ai.txt
Purpose Controls crawler access Provides AI-readable site context Declares AI usage policies
What it does Tells crawlers which pages they can or cannot visit Tells AI systems what your site is about, key pages, and content structure Tells AI systems how they may use your content (licensing, attribution)
Standard Established (1994) Emerging (llms-txt.site) Proposed (not yet standardized)
Location /robots.txt /llms.txt /ai.txt
Supported by All major crawlers Growing adoption by AI platforms Limited adoption
Required for AI visibility? Yes -- must allow crawlers Recommended -- improves citation quality Optional

How They Work Together

Think of it as a three-layer system:

  1. robots.txt is the gatekeeper. It determines which AI crawlers can access your site at all.
  2. llms.txt is the tour guide. Once a crawler is allowed in, llms.txt tells it what your site is about, which pages matter most, and how content is organized.
  3. ai.txt is the policy document. It communicates your terms for how AI systems may use the content they find.

If you only implement one, make it robots.txt with AI crawler permissions. If you implement two, add llms.txt. For the full stack, add ai.txt as well.

For a detailed guide on llms.txt, see our llms.txt pillar page. For the complete AI discoverability framework, see the AI Discovery Protocol documentation.


How AI Crawlers Differ from Search Engine Crawlers

AI crawlers and traditional search engine crawlers (Googlebot, Bingbot) behave differently in several important ways.

Crawl Frequency and Depth

Traditional search engine crawlers maintain persistent indexes and re-crawl pages on regular schedules based on crawl budgets. AI crawlers tend to be more aggressive in initial crawling but less predictable in re-crawl timing. Some AI crawlers (like ChatGPT-User) fetch pages on demand in real time rather than maintaining a pre-built index.

Content Extraction

Googlebot primarily processes HTML structure, metadata, and links for ranking purposes. AI crawlers extract full text content for semantic understanding. They parse your content the way a human reader would -- paragraph by paragraph -- looking for facts, claims, statistics, and quotable statements that can be cited in AI-generated answers.

Respect for Robots.txt

All major AI crawlers respect robots.txt directives. This is a notable improvement from the early days of AI crawling (2023-2024), when compliance was inconsistent. OpenAI, Anthropic, and Perplexity have all publicly committed to honoring robots.txt.

Impact of Blocking

Blocking Googlebot removes you from Google Search entirely. Blocking an AI crawler removes you from that specific AI platform's knowledge base and citation pool. The difference is scope: blocking Googlebot affects billions of daily searches; blocking ClaudeBot affects Claude users specifically.

Crawl Rate

Some AI crawlers, particularly Bytespider, are known for aggressive crawl rates that can strain server resources. If you notice performance issues, you can use Crawl-delay directives (supported by some crawlers) or rate-limiting at the server level.


Common Mistakes

1. Accidentally Blocking AI Crawlers with Wildcard Rules

Many legacy robots.txt files include a broad wildcard block:

User-agent: *
Disallow: /

This blocks every crawler, including AI crawlers, unless you add specific allow rules above it. If your robots.txt has a wildcard disallow, you must add explicit User-agent blocks for each AI crawler you want to permit, placed before the wildcard rule.

Fix:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Disallow: /

2. Blocking GPTBot but Expecting ChatGPT Citations

GPTBot is the foundational crawler for OpenAI's entire ecosystem. While ChatGPT-User and OAI-SearchBot handle real-time browsing and search respectively, GPTBot builds the underlying index that informs ChatGPT's knowledge. Blocking GPTBot significantly reduces your chances of being cited, even if you allow the other OpenAI crawlers.

3. Not Having a Robots.txt at All

If your site has no robots.txt file, AI crawlers will crawl everything by default. While this is technically permissive, it also means you have no control over which pages are crawled. Private pages, admin panels, staging content, and duplicate pages may all be indexed. Always have a robots.txt file, even if your policy is permissive.

4. Using Robots Meta Tags Instead of Robots.txt

HTML meta tags like <meta name="robots" content="noindex"> affect search engines but are not consistently respected by all AI crawlers. For reliable AI crawler control, always use robots.txt as the primary mechanism and treat meta tags as a secondary layer.

5. Forgetting to Include a Sitemap Reference

Your robots.txt should include a Sitemap: directive pointing to your XML sitemap. AI crawlers use sitemaps to discover content efficiently. Without it, crawlers must discover pages by following links, which is slower and may miss orphaned pages.

Sitemap: https://yoursite.com/sitemap.xml

Frequently Asked Questions

How do I allow AI crawlers in robots.txt?

Add a User-agent directive for each AI crawler you want to allow, followed by Allow: /. Each crawler needs its own block. For example:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

Place these directives in your robots.txt file at the root of your domain. There is no registration or approval process -- once the directives are in place, crawlers will detect them on their next visit.

Should I block AI crawlers?

It depends entirely on your business goals.

Allow AI crawlers if you want your content cited by ChatGPT, Claude, Perplexity, and other AI platforms. AI citations drive qualified traffic and brand authority in the Citation Economy. For most businesses -- especially those publishing press releases, thought leadership, or product content -- allowing AI crawlers is strongly recommended.

Block AI crawlers if you are a publisher concerned about content being used for AI model training without compensation, or if you have proprietary content that should not be indexed outside traditional search engines. Even in this case, consider a selective approach: block training-only crawlers while allowing citation crawlers.

What AI crawlers exist in 2026?

The twelve major AI crawlers active in 2026 are: GPTBot, ChatGPT-User, and OAI-SearchBot (OpenAI); ClaudeBot (Anthropic); PerplexityBot (Perplexity AI); Google-Extended and GoogleOther (Google); Bytespider (ByteDance); CCBot (Common Crawl); FacebookBot (Meta); Applebot-Extended (Apple); and cohere-ai (Cohere). See the full comparison table above for details on each crawler's purpose and citation impact.

What is the difference between robots.txt, ai.txt, and llms.txt?

Robots.txt is the access control layer. It tells crawlers which pages they are permitted to visit. It has been the web standard for crawler management since 1994 and is supported by all major AI crawlers.

Llms.txt is the context layer. It provides AI systems with a structured summary of your site: what you do, which pages are most important, and how your content is organized. It helps AI systems understand your site holistically rather than page by page.

Ai.txt is the policy layer. It declares how AI systems may use your content -- whether it can be used for training, whether attribution is required, and any licensing terms. It is a proposed standard that has not yet achieved wide adoption.

All three are complementary. Robots.txt is essential. Llms.txt is strongly recommended. Ai.txt is optional but forward-looking.

Does blocking GPTBot prevent ChatGPT from citing my content?

Yes. GPTBot is OpenAI's primary web crawler, and blocking it in robots.txt prevents OpenAI from indexing your content. This means ChatGPT cannot cite your pages in its responses, and your content will not appear in SearchGPT results.

If you block GPTBot but allow ChatGPT-User, ChatGPT may still be able to access your pages during live browsing sessions (when a user explicitly asks it to visit a URL). However, your content will not appear in ChatGPT's proactive search results or be part of its indexed knowledge base. For most practical purposes, blocking GPTBot means invisibility to ChatGPT.


Conclusion

Your robots.txt file is no longer just about managing Googlebot. In 2026, it is the primary control mechanism for AI crawler access and directly determines whether your content appears in AI-generated citations across ChatGPT, Claude, Perplexity, and other platforms.

For most businesses, the recommended approach is:

  1. Allow GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, and PerplexityBot for maximum citation visibility.
  2. Make a deliberate decision about training-only crawlers (Google-Extended, CCBot, Bytespider) based on your content licensing stance.
  3. Add a Sitemap directive to help crawlers discover your content efficiently.
  4. Complement robots.txt with an llms.txt file for richer AI context.
  5. Test your configuration regularly using the AI Visibility Checker.

The businesses winning in the Citation Economy are the ones making deliberate, informed decisions about AI crawler access. Your robots.txt is where that starts.


Related Resources

👤

About Pressonify Team

The Pressonify Team builds AI-first press release infrastructure for the Citation Economy.