Dynamic llms.txt: Technical Implementation & Spec Proposal

TL;DR

This is the technical companion to Part 1. Here you'll find:

Complete FastAPI code for dynamic llms.txt
ADP header generation patterns
Database query optimization strategies
A proposed spec extension for scope-specific llms.txt files

All code is production-tested on Pressonify.ai.

Architecture Overview

Before diving into code, here's the high-level flow:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   AI Crawler    │────▶│    FastAPI      │────▶│    Supabase     │
│  (Perplexity,   │     │    Endpoint     │     │    Database     │
│   ChatGPT)      │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │
        │                       │                       │
        ▼                       ▼                       ▼
   Request with           Query published          Return press
   If-None-Match          press releases           releases as
   header (ETag)          (limit, fields)          list of dicts
        │                       │                       │
        │                       │                       │
        └───────────────────────┴───────────────────────┘
                                │
                                ▼
                    ┌─────────────────────┐
                    │  Generate Response  │
                    │  - YAML frontmatter │
                    │  - Markdown content │
                    │  - ADP headers      │
                    └─────────────────────┘

Why Computed Over Cached?

We chose to compute content at request time rather than cache it because:

Freshness > Latency: A few extra milliseconds is worth always-accurate content
Simplicity: No cache invalidation logic to maintain
Database is fast: Supabase queries return in <50ms for our use case
Headers handle efficiency: ETag support means crawlers skip redundant downloads

The FastAPI Implementation

Here's the core endpoint structure:

from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
from datetime import datetime
import hashlib
import base64

app = FastAPI()

@app.get("/news/llms.txt", response_class=PlainTextResponse)
async def news_llms_txt():
    """
    Dynamic llms.txt endpoint for news content.

    Generates at request time from database with:
    - Computed YAML frontmatter
    - Latest press releases
    - ADP-compliant HTTP headers
    """
    # 1. Fetch content from database
    prs = await get_published_prs(limit=50)

    # 2. Generate llms.txt content
    content = generate_llms_content(prs)

    # 3. Generate ADP headers
    headers = generate_adp_headers(content, frequency="realtime")

    return PlainTextResponse(content, headers=headers)

The Content Generator

def generate_llms_content(prs: list) -> str:
    """
    Generate llms.txt content from press release data.
    """
    now = datetime.utcnow().isoformat() + "Z"

    # YAML Frontmatter with computed fields
    content = f"""---
version: 2.9.5
lastModified: {now}
totalArticles: {len(prs)}
scope: news-content-only
updateFrequency: realtime
protocol: AI Discovery Protocol v2.1
---

# Pressonify.ai News Feed

> {len(prs)} press releases optimized for AI citation

## Latest Press Releases

"""

    # Add each press release
    for pr in prs:
        content += f"""### {pr['headline']}
- **Company**: {pr['company_name']}
- **Category**: {pr['category']}
- **Published**: {pr['published_at']}
- **URL**: https://pressonify.ai/news/{pr['slug']}-{pr['id']}
- **Summary**: {pr['summary'][:200]}...

"""

    # Add available feeds section
    content += """## Available Feeds

For real-time updates, subscribe to our feeds:

- **RSS**: https://pressonify.ai/rss
- **JSON Feed**: https://pressonify.ai/feed.json
- **Delta Updates**: https://pressonify.ai/updates.json
- **Bulk Archive**: https://pressonify.ai/news/archive.jsonl

## About This Endpoint

This `/news/llms.txt` endpoint is dynamically generated from our database
at request time. Unlike static llms.txt files, every request returns the
current state of our news feed.

See also: https://pressonify.ai/llms.txt (full site context)
"""

    return content

ADP Header Generation

Headers are critical for efficient crawling. Here's our generator:

import hashlib
import base64
from typing import Literal

def generate_adp_headers(
    content: str,
    frequency: Literal["realtime", "hourly", "daily", "weekly"] = "daily"
) -> dict:
    """
    Generate AI Discovery Protocol compliant HTTP headers.

    Args:
        content: The response body content
        frequency: Update frequency hint for crawlers

    Returns:
        Dict of HTTP headers
    """
    # Content-based hashes
    content_bytes = content.encode('utf-8')
    sha256_hash = hashlib.sha256(content_bytes).digest()
    md5_hash = hashlib.md5(content_bytes).hexdigest()

    # Cache durations by frequency
    cache_durations = {
        "realtime": 300,    # 5 minutes
        "hourly": 3600,     # 1 hour
        "daily": 86400,     # 24 hours
        "weekly": 604800    # 7 days
    }

    return {
        # Cache validation
        "ETag": f'W/"{md5_hash}"',

        # Content integrity (RFC 9530)
        "Content-Digest": f"sha-256=:{base64.b64encode(sha256_hash).decode()}:",

        # Crawler scheduling hint
        "X-Update-Frequency": frequency,

        # Browser/CDN caching
        "Cache-Control": f"public, max-age={cache_durations[frequency]}",

        # CORS for browser-based AI tools
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Expose-Headers": "ETag, Content-Digest, X-Update-Frequency",

        # Content type
        "Content-Type": "text/plain; charset=utf-8"
    }

Header Breakdown

Header	RFC	Purpose
`ETag`	RFC 7232	Weak validator for cache freshness. Crawlers send `If-None-Match` to skip unchanged content.
`Content-Digest`	RFC 9530	SHA-256 hash of body for integrity verification.
`X-Update-Frequency`	Custom	Hints how often crawlers should return.
`Cache-Control`	RFC 7234	Browser and CDN caching directives.
`Access-Control-*`	CORS	Enables browser-based AI tools to fetch content.

Database Query Optimization

For real-time generation, query efficiency matters:

async def get_published_prs(
    limit: int = 50,
    since: str = None,
    category: str = None
) -> list:
    """
    Fetch published press releases optimized for llms.txt.

    Only fetches fields needed for the llms.txt format.
    """
    supabase = get_supabase_client()

    # Start query with minimal field selection
    query = supabase.table("press_releases").select(
        "id, slug, headline, summary, company_name, category, published_at"
    ).eq(
        "status", "published"
    ).order(
        "published_at", desc=True
    ).limit(limit)

    # Optional filters
    if since:
        query = query.gte("published_at", since)
    if category:
        query = query.eq("category", category)

    result = query.execute()
    return result.data or []

Optimization Strategies

Select only needed fields: Don't fetch body if you only need summary
Limit results: 50 items is usually enough for llms.txt context
Index columns: Ensure status and published_at are indexed
Connection pooling: Reuse database connections across requests

Query Performance

On our production database with ~1,000 press releases:

Query	Time
Full fetch (all fields)	~120ms
Optimized fetch (6 fields)	~35ms
With category filter	~28ms

The optimized query is 3-4x faster than naive fetch.

YAML Frontmatter Generation

The frontmatter is the "metadata about metadata"—it tells AI systems about the document itself:

def generate_yaml_frontmatter(
    prs: list,
    scope: str = "news-content-only"
) -> str:
    """
    Generate YAML frontmatter with computed fields.
    """
    now = datetime.utcnow()

    # Calculate update frequency based on publication rate
    recent_count = sum(
        1 for pr in prs
        if (now - parse_date(pr['published_at'])).days < 1
    )

    if recent_count > 5:
        frequency = "realtime"
    elif recent_count > 0:
        frequency = "hourly"
    else:
        frequency = "daily"

    return f"""---
version: 2.9.5
lastModified: {now.isoformat()}Z
totalArticles: {len(prs)}
scope: {scope}
updateFrequency: {frequency}
protocol: AI Discovery Protocol v2.1
generator: Pressonify.ai Dynamic llms.txt v1.0
---"""

Computed Fields Explained

Field	Type	Purpose
`version`	Static	API/content version for compatibility
`lastModified`	Computed	Exact generation timestamp
`totalArticles`	Computed	Database count for context
`scope`	Scoped	What content this file covers
`updateFrequency`	Computed	Based on recent publication rate
`generator`	Static	Identifies the generating system

Proposed Spec Extension: Scope-Specific llms.txt

Based on our experience, we're proposing an extension to the llms.txt specification.

Problem Statement

Multi-purpose websites have different content types that require different context for AI systems:

News sites: Articles, breaking news, archives
E-commerce: Products, categories, reviews
SaaS platforms: Documentation, blog, changelog

A single /llms.txt file becomes bloated trying to cover everything, or too shallow to be useful for any specific use case.

Proposed Solution: `/[scope]/llms.txt`

Allow scope-specific llms.txt files at path prefixes:

/llms.txt              → Full site overview + links to scoped variants
/news/llms.txt         → News content only
/docs/llms.txt         → Documentation only
/products/llms.txt     → Product catalog only
/blog/llms.txt         → Blog posts only

Reference Implementation

Root /llms.txt (links to variants):

---
version: 1.0
lastModified: 2026-01-04T10:00:00Z
hasVariants: true
---

# Pressonify.ai

> AI-powered press release platform

## Scoped Variants

For focused content, see our scope-specific llms.txt files:

- [/news/llms.txt](/news/llms.txt) - Press releases only (realtime)
- [/blog/llms.txt](/blog/llms.txt) - Blog posts (daily)
- [/docs/llms.txt](/docs/llms.txt) - Documentation (weekly)

## Full Site Overview

[General platform description...]

Scoped /news/llms.txt:

---
version: 1.0
lastModified: 2026-01-04T10:00:00Z
scope: news
parent: /llms.txt
---

# Pressonify.ai News

> 247 press releases optimized for AI citation

[News-specific content only...]

Backward Compatibility

This extension is fully backward compatible:

Existing crawlers that only look for /llms.txt still work
The root file can be static (traditional) or dynamic
Scoped variants are optional—not required by the spec
New crawlers can discover variants via the hasVariants field

Benefits

Benefit	Description
Reduced bloat	Each file focuses on one content type
Better relevance	AI systems get scoped context for specific queries
Efficient crawling	Crawlers can target specific scopes they care about
Independent update frequencies	News can be `realtime` while docs are `weekly`

Testing & Validation

How do we verify this actually works?

1. Crawler Logging

We log every AI crawler hit:

AI_CRAWLER_PATTERNS = [
    "GPTBot", "PerplexityBot", "Claude-Web",
    "Anthropic", "Google-Extended", "Bingbot",
    "CCBot", "ChatGPT-User", "Bytespider",
    "Amazonbot", "AppleBot"
]

async def log_crawler_hit(request: Request, endpoint: str):
    """Fire-and-forget crawler logging."""
    user_agent = request.headers.get("User-Agent", "")

    for pattern in AI_CRAWLER_PATTERNS:
        if pattern.lower() in user_agent.lower():
            await supabase.table("adp_crawler_hits").insert({
                "endpoint": endpoint,
                "crawler": pattern,
                "timestamp": datetime.utcnow().isoformat()
            }).execute()
            break

2. Public Stats API

We expose crawler statistics publicly:

GET /api/v1/adp/stats

{
  "total_hits": 1247,
  "by_endpoint": {
    "/llms.txt": 423,
    "/news/llms.txt": 312,
    "/ai-discovery.json": 289
  },
  "by_crawler": {
    "GPTBot": 456,
    "PerplexityBot": 389,
    "Claude-Web": 201
  }
}

3. Expected Crawl Patterns

After deploying dynamic llms.txt, we observed:

Crawler	Before	After	Change
PerplexityBot	2/week	4/day	+12x
GPTBot	1/week	2/day	+14x
Claude-Web	3/week	1/day	+2x

The increased crawl frequency suggests AI systems recognize the real-time nature of our content.

Results & Learnings

What Worked

Real-time metadata — Crawlers respect the X-Update-Frequency header
ETag support — 40% of subsequent requests use If-None-Match
Scoped endpoints — PerplexityBot specifically hits /news/llms.txt

What Didn't

Complex filtering — We built ?category= filtering but no crawlers use it yet
Extended YAML fields — Custom fields like generator are ignored by current AI systems

Future Enhancements

Company-specific feeds: /company/{name}/llms.txt
Time-windowed exports: /news/llms.txt?since=2026-01-01
Format negotiation: Return JSON-LD for bots that prefer it

Contributing to the Spec

We're proposing the scope-specific variant pattern to the llms.txt community.

How to Provide Feedback

GitHub Discussions: llmstxt.org discussions
Email: [email protected]
Twitter/X: @pressonify

Our Commitment

We're committed to:

Sharing our learnings publicly
Contributing back to the llms.txt spec
Maintaining backward compatibility
Open-sourcing reusable components

Full Code Reference

All code from this post is available in our implementation:

Endpoint: main.py lines 5140-5200
Header generation: app/utils/adp_headers.py
Crawler logging: app/middleware/crawler_logger.py

For questions or collaboration, reach out at [email protected].

Resources

llms.txt Specification
RFC 7232 - Conditional Requests — ETag standard
RFC 9530 - Digest Fields — Content-Digest
Part 1: Beyond Static — Business context
AI Discovery Protocol v2.1 — Our broader ADP

This is Part 2 of a 2-part series on Dynamic llms.txt. Part 1 covers the business rationale and innovation assessment.

Dynamic llms.txt: Technical Implementation & Spec Proposal

Dynamic llms.txt: Technical Implementation & Spec Proposal

TL;DR

Architecture Overview

Why Computed Over Cached?

The FastAPI Implementation

The Content Generator

ADP Header Generation

Header Breakdown

Database Query Optimization

Optimization Strategies

Query Performance

YAML Frontmatter Generation

Computed Fields Explained

Proposed Spec Extension: Scope-Specific llms.txt

Problem Statement

Proposed Solution: /[scope]/llms.txt

Reference Implementation

Backward Compatibility

Benefits

Testing & Validation

1. Crawler Logging

2. Public Stats API

3. Expected Crawl Patterns

Results & Learnings

What Worked

What Didn't

Future Enhancements

Contributing to the Spec

How to Provide Feedback

Our Commitment

Full Code Reference

Resources

Proposed Solution: `/[scope]/llms.txt`