Scroll for more

Dynamic llms.txt: Technical Implementation & Spec Proposal

← Back to Blog
A deep technical dive into implementing dynamic llms.txt with FastAPI, plus our proposal for extending the llms.txt specification with scope-specific variants.
Share:

Dynamic llms.txt: Technical Implementation & Spec Proposal

TL;DR

This is the technical companion to Part 1. Here you'll find:

  • Complete FastAPI code for dynamic llms.txt
  • ADP header generation patterns
  • Database query optimization strategies
  • A proposed spec extension for scope-specific llms.txt files

All code is production-tested on Pressonify.ai.


Architecture Overview

Before diving into code, here's the high-level flow:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   AI Crawler    │────▢│    FastAPI      │────▢│    Supabase     β”‚
β”‚  (Perplexity,   β”‚     β”‚    Endpoint     β”‚     β”‚    Database     β”‚
β”‚   ChatGPT)      β”‚     β”‚                 β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚                       β”‚
        β”‚                       β”‚                       β”‚
        β–Ό                       β–Ό                       β–Ό
   Request with           Query published          Return press
   If-None-Match          press releases           releases as
   header (ETag)          (limit, fields)          list of dicts
        β”‚                       β”‚                       β”‚
        β”‚                       β”‚                       β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Generate Response  β”‚
                    β”‚  - YAML frontmatter β”‚
                    β”‚  - Markdown content β”‚
                    β”‚  - ADP headers      β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why Computed Over Cached?

We chose to compute content at request time rather than cache it because:

  1. Freshness > Latency: A few extra milliseconds is worth always-accurate content
  2. Simplicity: No cache invalidation logic to maintain
  3. Database is fast: Supabase queries return in <50ms for our use case
  4. Headers handle efficiency: ETag support means crawlers skip redundant downloads

The FastAPI Implementation

Here's the core endpoint structure:

from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
from datetime import datetime
import hashlib
import base64

app = FastAPI()

@app.get("/news/llms.txt", response_class=PlainTextResponse)
async def news_llms_txt():
    """
    Dynamic llms.txt endpoint for news content.

    Generates at request time from database with:
    - Computed YAML frontmatter
    - Latest press releases
    - ADP-compliant HTTP headers
    """
    # 1. Fetch content from database
    prs = await get_published_prs(limit=50)

    # 2. Generate llms.txt content
    content = generate_llms_content(prs)

    # 3. Generate ADP headers
    headers = generate_adp_headers(content, frequency="realtime")

    return PlainTextResponse(content, headers=headers)

The Content Generator

def generate_llms_content(prs: list) -> str:
    """
    Generate llms.txt content from press release data.
    """
    now = datetime.utcnow().isoformat() + "Z"

    # YAML Frontmatter with computed fields
    content = f"""---
version: 2.9.5
lastModified: {now}
totalArticles: {len(prs)}
scope: news-content-only
updateFrequency: realtime
protocol: AI Discovery Protocol v2.1
---

# Pressonify.ai News Feed

> {len(prs)} press releases optimized for AI citation

## Latest Press Releases

"""

    # Add each press release
    for pr in prs:
        content += f"""### {pr['headline']}
- **Company**: {pr['company_name']}
- **Category**: {pr['category']}
- **Published**: {pr['published_at']}
- **URL**: https://pressonify.ai/news/{pr['slug']}-{pr['id']}
- **Summary**: {pr['summary'][:200]}...

"""

    # Add available feeds section
    content += """## Available Feeds

For real-time updates, subscribe to our feeds:

- **RSS**: https://pressonify.ai/rss
- **JSON Feed**: https://pressonify.ai/feed.json
- **Delta Updates**: https://pressonify.ai/updates.json
- **Bulk Archive**: https://pressonify.ai/news/archive.jsonl

## About This Endpoint

This `/news/llms.txt` endpoint is dynamically generated from our database
at request time. Unlike static llms.txt files, every request returns the
current state of our news feed.

See also: https://pressonify.ai/llms.txt (full site context)
"""

    return content

ADP Header Generation

Headers are critical for efficient crawling. Here's our generator:

import hashlib
import base64
from typing import Literal

def generate_adp_headers(
    content: str,
    frequency: Literal["realtime", "hourly", "daily", "weekly"] = "daily"
) -> dict:
    """
    Generate AI Discovery Protocol compliant HTTP headers.

    Args:
        content: The response body content
        frequency: Update frequency hint for crawlers

    Returns:
        Dict of HTTP headers
    """
    # Content-based hashes
    content_bytes = content.encode('utf-8')
    sha256_hash = hashlib.sha256(content_bytes).digest()
    md5_hash = hashlib.md5(content_bytes).hexdigest()

    # Cache durations by frequency
    cache_durations = {
        "realtime": 300,    # 5 minutes
        "hourly": 3600,     # 1 hour
        "daily": 86400,     # 24 hours
        "weekly": 604800    # 7 days
    }

    return {
        # Cache validation
        "ETag": f'W/"{md5_hash}"',

        # Content integrity (RFC 9530)
        "Content-Digest": f"sha-256=:{base64.b64encode(sha256_hash).decode()}:",

        # Crawler scheduling hint
        "X-Update-Frequency": frequency,

        # Browser/CDN caching
        "Cache-Control": f"public, max-age={cache_durations[frequency]}",

        # CORS for browser-based AI tools
        "Access-Control-Allow-Origin": "*",
        "Access-Control-Expose-Headers": "ETag, Content-Digest, X-Update-Frequency",

        # Content type
        "Content-Type": "text/plain; charset=utf-8"
    }

Header Breakdown

Header RFC Purpose
ETag RFC 7232 Weak validator for cache freshness. Crawlers send If-None-Match to skip unchanged content.
Content-Digest RFC 9530 SHA-256 hash of body for integrity verification.
X-Update-Frequency Custom Hints how often crawlers should return.
Cache-Control RFC 7234 Browser and CDN caching directives.
Access-Control-* CORS Enables browser-based AI tools to fetch content.

Database Query Optimization

For real-time generation, query efficiency matters:

async def get_published_prs(
    limit: int = 50,
    since: str = None,
    category: str = None
) -> list:
    """
    Fetch published press releases optimized for llms.txt.

    Only fetches fields needed for the llms.txt format.
    """
    supabase = get_supabase_client()

    # Start query with minimal field selection
    query = supabase.table("press_releases").select(
        "id, slug, headline, summary, company_name, category, published_at"
    ).eq(
        "status", "published"
    ).order(
        "published_at", desc=True
    ).limit(limit)

    # Optional filters
    if since:
        query = query.gte("published_at", since)
    if category:
        query = query.eq("category", category)

    result = query.execute()
    return result.data or []

Optimization Strategies

  1. Select only needed fields: Don't fetch body if you only need summary
  2. Limit results: 50 items is usually enough for llms.txt context
  3. Index columns: Ensure status and published_at are indexed
  4. Connection pooling: Reuse database connections across requests

Query Performance

On our production database with ~1,000 press releases:

Query Time
Full fetch (all fields) ~120ms
Optimized fetch (6 fields) ~35ms
With category filter ~28ms

The optimized query is 3-4x faster than naive fetch.


YAML Frontmatter Generation

The frontmatter is the "metadata about metadata"β€”it tells AI systems about the document itself:

def generate_yaml_frontmatter(
    prs: list,
    scope: str = "news-content-only"
) -> str:
    """
    Generate YAML frontmatter with computed fields.
    """
    now = datetime.utcnow()

    # Calculate update frequency based on publication rate
    recent_count = sum(
        1 for pr in prs
        if (now - parse_date(pr['published_at'])).days < 1
    )

    if recent_count > 5:
        frequency = "realtime"
    elif recent_count > 0:
        frequency = "hourly"
    else:
        frequency = "daily"

    return f"""---
version: 2.9.5
lastModified: {now.isoformat()}Z
totalArticles: {len(prs)}
scope: {scope}
updateFrequency: {frequency}
protocol: AI Discovery Protocol v2.1
generator: Pressonify.ai Dynamic llms.txt v1.0
---"""

Computed Fields Explained

Field Type Purpose
version Static API/content version for compatibility
lastModified Computed Exact generation timestamp
totalArticles Computed Database count for context
scope Scoped What content this file covers
updateFrequency Computed Based on recent publication rate
generator Static Identifies the generating system

Proposed Spec Extension: Scope-Specific llms.txt

Based on our experience, we're proposing an extension to the llms.txt specification.

Problem Statement

Multi-purpose websites have different content types that require different context for AI systems:

  • News sites: Articles, breaking news, archives
  • E-commerce: Products, categories, reviews
  • SaaS platforms: Documentation, blog, changelog

A single /llms.txt file becomes bloated trying to cover everything, or too shallow to be useful for any specific use case.

Proposed Solution: /[scope]/llms.txt

Allow scope-specific llms.txt files at path prefixes:

/llms.txt              β†’ Full site overview + links to scoped variants
/news/llms.txt         β†’ News content only
/docs/llms.txt         β†’ Documentation only
/products/llms.txt     β†’ Product catalog only
/blog/llms.txt         β†’ Blog posts only

Reference Implementation

Root /llms.txt (links to variants):

---
version: 1.0
lastModified: 2026-01-04T10:00:00Z
hasVariants: true
---

# Pressonify.ai

> AI-powered press release platform

## Scoped Variants

For focused content, see our scope-specific llms.txt files:

- [/news/llms.txt](/news/llms.txt) - Press releases only (realtime)
- [/blog/llms.txt](/blog/llms.txt) - Blog posts (daily)
- [/docs/llms.txt](/docs/llms.txt) - Documentation (weekly)

## Full Site Overview

[General platform description...]

Scoped /news/llms.txt:

---
version: 1.0
lastModified: 2026-01-04T10:00:00Z
scope: news
parent: /llms.txt
---

# Pressonify.ai News

> 247 press releases optimized for AI citation

[News-specific content only...]

Backward Compatibility

This extension is fully backward compatible:

  1. Existing crawlers that only look for /llms.txt still work
  2. The root file can be static (traditional) or dynamic
  3. Scoped variants are optionalβ€”not required by the spec
  4. New crawlers can discover variants via the hasVariants field

Benefits

Benefit Description
Reduced bloat Each file focuses on one content type
Better relevance AI systems get scoped context for specific queries
Efficient crawling Crawlers can target specific scopes they care about
Independent update frequencies News can be realtime while docs are weekly

Testing & Validation

How do we verify this actually works?

1. Crawler Logging

We log every AI crawler hit:

AI_CRAWLER_PATTERNS = [
    "GPTBot", "PerplexityBot", "Claude-Web",
    "Anthropic", "Google-Extended", "Bingbot",
    "CCBot", "ChatGPT-User", "Bytespider",
    "Amazonbot", "AppleBot"
]

async def log_crawler_hit(request: Request, endpoint: str):
    """Fire-and-forget crawler logging."""
    user_agent = request.headers.get("User-Agent", "")

    for pattern in AI_CRAWLER_PATTERNS:
        if pattern.lower() in user_agent.lower():
            await supabase.table("adp_crawler_hits").insert({
                "endpoint": endpoint,
                "crawler": pattern,
                "timestamp": datetime.utcnow().isoformat()
            }).execute()
            break

2. Public Stats API

We expose crawler statistics publicly:

GET /api/v1/adp/stats

{
  "total_hits": 1247,
  "by_endpoint": {
    "/llms.txt": 423,
    "/news/llms.txt": 312,
    "/ai-discovery.json": 289
  },
  "by_crawler": {
    "GPTBot": 456,
    "PerplexityBot": 389,
    "Claude-Web": 201
  }
}

3. Expected Crawl Patterns

After deploying dynamic llms.txt, we observed:

Crawler Before After Change
PerplexityBot 2/week 4/day +12x
GPTBot 1/week 2/day +14x
Claude-Web 3/week 1/day +2x

The increased crawl frequency suggests AI systems recognize the real-time nature of our content.


Results & Learnings

What Worked

  1. Real-time metadata β€” Crawlers respect the X-Update-Frequency header
  2. ETag support β€” 40% of subsequent requests use If-None-Match
  3. Scoped endpoints β€” PerplexityBot specifically hits /news/llms.txt

What Didn't

  1. Complex filtering β€” We built ?category= filtering but no crawlers use it yet
  2. Extended YAML fields β€” Custom fields like generator are ignored by current AI systems

Future Enhancements

  • Company-specific feeds: /company/{name}/llms.txt
  • Time-windowed exports: /news/llms.txt?since=2026-01-01
  • Format negotiation: Return JSON-LD for bots that prefer it

Contributing to the Spec

We're proposing the scope-specific variant pattern to the llms.txt community.

How to Provide Feedback

  1. GitHub Discussions: llmstxt.org discussions
  2. Email: [email protected]
  3. Twitter/X: @pressonify

Our Commitment

We're committed to:

  • Sharing our learnings publicly
  • Contributing back to the llms.txt spec
  • Maintaining backward compatibility
  • Open-sourcing reusable components

Full Code Reference

All code from this post is available in our implementation:

  • Endpoint: main.py lines 5140-5200
  • Header generation: app/utils/adp_headers.py
  • Crawler logging: app/middleware/crawler_logger.py

For questions or collaboration, reach out at [email protected].


Resources


This is Part 2 of a 2-part series on Dynamic llms.txt. Part 1 covers the business rationale and innovation assessment.

πŸ“š Part 2 of 2: Dynamic llms.txt Innovation
← Previous: Part 1