Why Audit for AI Crawlers?
Traditional SEO audits check for search engine crawlers (Googlebot, Bingbot). But in 2026, AI crawlers are equally important:
- GPTBot (OpenAI) - Feeds ChatGPT and SearchGPT
- ClaudeBot (Anthropic) - Powers Claude AI assistant
- PerplexityBot - Fuels Perplexity answer engine
- GoogleOther - Google's AI training crawler
- Applebot-Extended - Apple Intelligence features
- anthropic-ai - Anthropic's research crawler
- Amazonbot - Amazon Alexa and AI services
- Bingbot-AI - Microsoft Copilot
- Bytespider - TikTok/ByteDance AI
- Meta-ExternalAgent - Meta AI (Facebook, Instagram)
- Diffbot - Knowledge graph extraction
Blocking even one of these crawlers can eliminate your presence from major AI platforms. An AI Crawler Audit ensures you're maximizing visibility in the Citation Economy.
robots.txt Configuration
Step 1 of any AI crawler audit: check your robots.txt file (yoursite.com/robots.txt). You should explicitly allow major AI crawlers:
# robots.txt - AI Crawler Configuration
# Allow major AI crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GoogleOther
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: Bytespider
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
# Optional: Protect sensitive areas
User-agent: *
Disallow: /admin/
Disallow: /private/
Critical mistake: Many sites have User-agent: * with Disallow: / which blocks ALL crawlers including AI. If you see this, you're invisible to AI systems. Fix immediately. Learn more in our LLMO guide.
llms.txt Verification
Step 2: Verify you have a properly formatted llms.txt file (yoursite.com/llms.txt). This is the 'AI context' file that provides AI crawlers with structured information about your site.
Checklist:
- β File exists at domain root (/llms.txt)
- β File is publicly accessible (200 status code, no authentication)
- β File size is under 2KB (use llms-full.txt for extended content)
- β Includes YAML frontmatter with version and lastModified
- β Contains site description and expertise areas
- β Lists 10-20 key pages with descriptions
- β Includes contact information
- β Updated within the last 90 days
Use our free llms.txt generator to create a compliant file in 60 seconds. View our live llms.txt example for reference. Full implementation details in our llms.txt guide.
AI Sitemap Check
Step 3: Verify your sitemap is AI-friendly. While traditional sitemaps list URLs, AI-specific sitemaps should include:
- Semantic metadata: What each page is about, not just the URL
- Update frequency: How often AI should re-crawl
- Priority signals: Which pages are most authoritative
- Content types: Article, Product, FAQ, HowTo, etc.
Example AI-optimized sitemap entry:
<url>
<loc>https://pressonify.ai/learn/geo</loc>
<lastmod>2026-01-03</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
<xhtml:meta name="description" content="GEO guide for AI citation" />
<xhtml:meta name="content-type" content="educational-guide" />
</url>
Include your sitemap in robots.txt:
Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/sitemap-ai.xml
AI Discovery Protocol (ADP) Endpoints
Step 4: Audit for full AI Discovery Protocol (ADP) compliance. Check for these 11 endpoints:
- β
/.well-known/ai.json- Main ADP manifest - β
/robots.txt- AI crawler permissions - β
/llms.txt- Compact site context - β
/llms-full.txt- Extended content (optional) - β
/sitemap.xml- Standard sitemap - β
/sitemap-ai.xml- AI-specific sitemap (optional) - β
/feed.json- JSON Feed v1.1 - β
/rss.xml- Traditional RSS feed - β
/updates.json- Delta feed for incremental crawling - β
/knowledge-graph.json- Schema.org entity catalog - β
/.well-known/security.txt- Security contact
Compliance levels:
- Basic (40%): robots.txt + llms.txt
- Standard (70%): + ai.json + sitemap
- Complete (100%): All 11 endpoints
Use our Agentic Audit tool to scan all endpoints automatically.
HTTP Header Verification
Step 5: Check that your AI-related endpoints include proper HTTP headers:
- ETag: Cache validation for efficient re-crawling
- Content-Digest: SHA-256 integrity verification (RFC 9530)
- X-Update-Frequency: Signals to AI crawlers (hourly/daily/weekly)
- X-LLM-Optimized: Indicates AI-optimized content
- Access-Control-Allow-Origin: CORS for AI tools (* for public content)
- Cache-Control: Appropriate caching directives
Test headers using:
curl -I https://yoursite.com/llms.txt
Look for:
HTTP/2 200
ETag: W/"abc123"
Content-Digest: sha-256=xyz789
X-Update-Frequency: weekly
Access-Control-Allow-Origin: *
Pressonify's press releases include all recommended headers automatically. Learn more about technical implementation in our Schema.org for AI guide.
Action Plan: Fixing Common Issues
Based on your audit, prioritize fixes:
π΄ Critical (fix immediately):
- robots.txt blocking AI crawlers with
Disallow: / - Missing llms.txt file (your site is invisible to LLMs)
- 404 errors on referenced endpoints in ai.json
π‘ High Priority (fix this week):
- Outdated llms.txt (lastModified > 6 months old)
- Missing /.well-known/ai.json manifest
- No Schema.org markup on key pages
π’ Medium Priority (fix this month):
- Missing AI-specific sitemap
- Missing HTTP headers (ETag, Content-Digest)
- No JSON Feed or updates.json delta feed
Start with Critical fixes to get AI crawlers accessing your site, then layer in higher-level optimizations. Track progress with our AI Visibility Checker.