ADP 2.1+: Proving AI Crawlers Actually Use Your Infrastructure

← Back to Blog
We built proof infrastructure that tracks AI crawler hits on ADP endpoints and monitors which press releases get cited. Here's the technical implementation.
Share:

ADP 2.1+: Proving AI Crawlers Actually Use Your Infrastructure

The Problem: Everyone claims to be "AI-optimized." Few can prove it.

We built 11 ADP endpoints for AI crawlers. But a natural question emerged: Are AI systems actually using them?

Building infrastructure is easy. Proving value is hard.

Today we're releasing proof infrastructure that answers three critical questions:

  1. Are AI crawlers hitting our ADP endpoints?
  2. Which crawlers, which endpoints, how often?
  3. Are our press releases actually getting cited?

The Gap Between Theory and Proof

When we launched ADP 2.1, we had beautiful infrastructure:

  • /llms.txt - Compact site overview (1,247 tokens)
  • /ai-discovery.json - Meta-index of all endpoints
  • /knowledge-graph.json - Entity relationships
  • /.well-known/ai.json - Discovery manifest

But we couldn't answer a simple question: "Is GPTBot actually crawling these?"

That's a credibility problem. Claiming AI optimization without proving AI engagement is just marketing.


What We Built: Proof Infrastructure

1. AI Crawler Hit Tracking

Every time an AI crawler hits an ADP endpoint, we log it:

Endpoint: /llms.txt
Crawler: GPTBot
User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.0; +https://openai.com/gptbot
IP: [anonymized]
Timestamp: 2025-12-27T19:00:00Z

Key design decisions:

  • Fire-and-forget logging: Uses asyncio.create_task() so logging never blocks the response
  • Zero performance impact: The endpoint returns immediately; logging happens in background
  • 11 crawler identification: GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Bingbot, Cohere-AI, YouBot, Applebot-Extended, Meta-ExternalAgent, CCBot, Diffbot
  • Fallback categories: Unknown bots → "other_bot", browsers → "human"

2. Public Transparency API

We believe proof should be public. Our /api/v1/adp/stats endpoint exposes aggregated crawler data:

# Get overall stats
curl https://pressonify.ai/api/v1/adp/stats?days=30

# Response
{
  "period_days": 30,
  "total_hits": 2847,
  "ai_crawler_hits": 2156,
  "human_hits": 691,
  "crawler_breakdown": {
    "GPTBot": 892,
    "ClaudeBot": 567,
    "PerplexityBot": 423,
    "Google-Extended": 274
  },
  "endpoint_breakdown": {
    "/llms.txt": 1247,
    "/ai-discovery.json": 856,
    "/knowledge-graph.json": 432
  }
}

Five public endpoints:
- GET /api/v1/adp/stats - Overall statistics
- GET /api/v1/adp/stats/crawlers - Top crawlers ranked
- GET /api/v1/adp/stats/endpoints - Top endpoints ranked
- GET /api/v1/adp/stats/daily - Daily trends
- GET /api/v1/adp/stats/endpoint/{endpoint} - Specific endpoint deep-dive

3. PR Citation Tracker

The ultimate proof: Are press releases actually being cited by AI?

Our Citation Tracker uses the Perplexity API to search for your content:

query_templates = [
    "{company_name} press release",
    "{company_name} latest news",
    "{company_name} announcement",
    "{company_name} {industry} news",
    "{company_name} launch",
    "news about {company_name}"
]

For each query, we check if the AI response cites back to the press release URL. When it does, we log:

  • Platform: Perplexity, ChatGPT, Claude, Gemini, SearchGPT
  • Query: What search triggered the citation
  • Position: Where in the response the citation appeared
  • Context: The text around the citation
  • Verification: Manual or automatic confirmation

Rate limiting: 5 scans per hour per IP (prevents abuse while enabling legitimate monitoring)

Query deduplication: SHA-256 hashing ensures the same query isn't scanned repeatedly within 24 hours


Technical Implementation

Crawler Identification

We maintain a mapping of AI crawler user agent patterns:

AI_CRAWLERS = {
    "GPTBot": "gptbot",
    "ClaudeBot": "claudebot",
    "PerplexityBot": "perplexitybot",
    "Google-Extended": "google-extended",
    "Bingbot": "bingbot",
    "Cohere-AI": "cohere-ai",
    "YouBot": "youbot",
    "Applebot-Extended": "applebot-extended",
    "Meta-ExternalAgent": "meta-externalagent",
    "CCBot": "ccbot",
    "Diffbot": "diffbot"
}

The identification function checks user agent strings case-insensitively:

def identify_crawler(user_agent: str) -> str:
    if not user_agent:
        return "unknown"

    ua_lower = user_agent.lower()

    for crawler_name, pattern in AI_CRAWLERS.items():
        if pattern in ua_lower:
            return crawler_name

    # Check for generic bot patterns
    if any(bot in ua_lower for bot in ["bot", "crawler", "spider"]):
        return "other_bot"

    return "human"

Fire-and-Forget Logging

Performance is critical. We can't slow down ADP endpoint responses to log hits. The solution: fire-and-forget async tasks.

from fastapi import Request
import asyncio

async def log_adp_hit(request: Request, endpoint: str):
    """Fire-and-forget logging - never blocks response"""
    asyncio.create_task(_log_adp_hit_async(request, endpoint))

async def _log_adp_hit_async(request: Request, endpoint: str):
    """Background task that actually logs the hit"""
    try:
        user_agent = request.headers.get("user-agent", "")
        crawler_name = identify_crawler(user_agent)
        ip_address = get_client_ip(request)
        referer = request.headers.get("referer")

        await supabase.table("adp_crawler_hits").insert({
            "endpoint": endpoint,
            "crawler_name": crawler_name,
            "user_agent": user_agent[:500],  # Truncate
            "ip_address": ip_address,
            "referer": referer[:1000] if referer else None
        }).execute()
    except Exception as e:
        logger.warning(f"Failed to log ADP hit: {e}")
        # Never raise - this is background work

Database Schema

CREATE TABLE adp_crawler_hits (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    endpoint VARCHAR(100) NOT NULL,
    crawler_name VARCHAR(100),
    user_agent TEXT,
    ip_address VARCHAR(45),
    referer TEXT,
    hit_at TIMESTAMPTZ DEFAULT NOW()
);

-- Indexes for efficient querying
CREATE INDEX idx_adp_hits_endpoint ON adp_crawler_hits(endpoint);
CREATE INDEX idx_adp_hits_crawler ON adp_crawler_hits(crawler_name);
CREATE INDEX idx_adp_hits_time ON adp_crawler_hits(hit_at DESC);

Enhanced PR Analytics

Beyond crawler tracking, we enhanced the PR analytics system:

Traffic Source Categorization

Every view is now categorized:

  • Direct: No referrer or direct navigation
  • Search: Google, Bing, Yahoo, DuckDuckGo, Baidu
  • Social: Twitter, LinkedIn, Facebook, Reddit, Instagram
  • Referral: Other websites linking to your PR

Device Breakdown

User agent parsing identifies:
- Mobile devices (iOS, Android)
- Desktop browsers (Chrome, Firefox, Safari)
- Tablets (iPad, Android tablets)
- Other (bots, crawlers, unknown)

Top Referrers

We extract and rank referring domains:

{
  "top_referrers": [
    {"domain": "google.com", "count": 456},
    {"domain": "twitter.com", "count": 234},
    {"domain": "linkedin.com", "count": 189}
  ]
}

Live Endpoints

All of this is live and testable right now:

ADP Stats

# Overall statistics
curl https://pressonify.ai/api/v1/adp/stats?days=30

# Top AI crawlers
curl https://pressonify.ai/api/v1/adp/stats/crawlers?days=7

# Top endpoints
curl https://pressonify.ai/api/v1/adp/stats/endpoints?days=7

# Daily trends
curl https://pressonify.ai/api/v1/adp/stats/daily?days=14

# Specific endpoint
curl https://pressonify.ai/api/v1/adp/stats/endpoint/llms.txt

PR Citations (requires authentication)

# Platform-wide stats
curl https://pressonify.ai/api/v1/citations/stats

# Citations for specific PR
curl https://pressonify.ai/api/v1/citations/pr/{pr_id}

# Trigger manual scan
curl -X POST https://pressonify.ai/api/v1/citations/scan/{pr_id}

Enhanced Analytics

# Detailed PR analytics
curl https://pressonify.ai/api/v1/analytics/pr/{pr_id}/detailed?days=30

Why This Matters

For credibility: "We're AI-optimized" is a claim. "GPTBot hit our endpoints 892 times this month" is proof.

For iteration: If ClaudeBot isn't crawling /knowledge-graph.json, we know to investigate why.

For transparency: Users can see which AI systems engage with their press releases.

For the ecosystem: Public transparency APIs help the entire industry understand AI crawler behavior.


What's Next

This is ADP 2.1+. The "+" represents our commitment to continuous improvement:

  1. Real-time dashboards: Visualize crawler activity live
  2. Alerting: Get notified when new AI crawlers appear
  3. Citation scoring: Quantify citation quality, not just count
  4. Scheduled scans: Automatic periodic citation checks
  5. Competitive benchmarking: Compare your citation rate to industry averages

The Bottom Line

Building AI infrastructure is necessary. Proving it works is what creates trust.

With ADP 2.1+, every Pressonify user can:
- See which AI crawlers engage with their content
- Track which press releases get cited
- Access detailed analytics on traffic sources and devices
- Verify that "AI optimization" isn't just marketing

The endpoints are live. The data is public. The proof is transparent.

That's the difference between claiming AI optimization and demonstrating it.


Test the live endpoints yourself:
- ADP Stats API
- AI Discovery Endpoints
- LLMs.txt

Published December 27, 2025 | Pressonify.ai