Dynamic llms.txt: Technical Implementation & Spec Proposal
TL;DR
This is the technical companion to Part 1. Here you'll find:
- Complete FastAPI code for dynamic llms.txt
- ADP header generation patterns
- Database query optimization strategies
- A proposed spec extension for scope-specific llms.txt files
All code is production-tested on Pressonify.ai.
Architecture Overview
Before diving into code, here's the high-level flow:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β AI Crawler ββββββΆβ FastAPI ββββββΆβ Supabase β
β (Perplexity, β β Endpoint β β Database β
β ChatGPT) β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β β β
βΌ βΌ βΌ
Request with Query published Return press
If-None-Match press releases releases as
header (ETag) (limit, fields) list of dicts
β β β
β β β
βββββββββββββββββββββββββ΄ββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββ
β Generate Response β
β - YAML frontmatter β
β - Markdown content β
β - ADP headers β
βββββββββββββββββββββββ
Why Computed Over Cached?
We chose to compute content at request time rather than cache it because:
- Freshness > Latency: A few extra milliseconds is worth always-accurate content
- Simplicity: No cache invalidation logic to maintain
- Database is fast: Supabase queries return in <50ms for our use case
- Headers handle efficiency: ETag support means crawlers skip redundant downloads
The FastAPI Implementation
Here's the core endpoint structure:
from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
from datetime import datetime
import hashlib
import base64
app = FastAPI()
@app.get("/news/llms.txt", response_class=PlainTextResponse)
async def news_llms_txt():
"""
Dynamic llms.txt endpoint for news content.
Generates at request time from database with:
- Computed YAML frontmatter
- Latest press releases
- ADP-compliant HTTP headers
"""
# 1. Fetch content from database
prs = await get_published_prs(limit=50)
# 2. Generate llms.txt content
content = generate_llms_content(prs)
# 3. Generate ADP headers
headers = generate_adp_headers(content, frequency="realtime")
return PlainTextResponse(content, headers=headers)
The Content Generator
def generate_llms_content(prs: list) -> str:
"""
Generate llms.txt content from press release data.
"""
now = datetime.utcnow().isoformat() + "Z"
# YAML Frontmatter with computed fields
content = f"""---
version: 2.9.5
lastModified: {now}
totalArticles: {len(prs)}
scope: news-content-only
updateFrequency: realtime
protocol: AI Discovery Protocol v2.1
---
# Pressonify.ai News Feed
> {len(prs)} press releases optimized for AI citation
## Latest Press Releases
"""
# Add each press release
for pr in prs:
content += f"""### {pr['headline']}
- **Company**: {pr['company_name']}
- **Category**: {pr['category']}
- **Published**: {pr['published_at']}
- **URL**: https://pressonify.ai/news/{pr['slug']}-{pr['id']}
- **Summary**: {pr['summary'][:200]}...
"""
# Add available feeds section
content += """## Available Feeds
For real-time updates, subscribe to our feeds:
- **RSS**: https://pressonify.ai/rss
- **JSON Feed**: https://pressonify.ai/feed.json
- **Delta Updates**: https://pressonify.ai/updates.json
- **Bulk Archive**: https://pressonify.ai/news/archive.jsonl
## About This Endpoint
This `/news/llms.txt` endpoint is dynamically generated from our database
at request time. Unlike static llms.txt files, every request returns the
current state of our news feed.
See also: https://pressonify.ai/llms.txt (full site context)
"""
return content
ADP Header Generation
Headers are critical for efficient crawling. Here's our generator:
import hashlib
import base64
from typing import Literal
def generate_adp_headers(
content: str,
frequency: Literal["realtime", "hourly", "daily", "weekly"] = "daily"
) -> dict:
"""
Generate AI Discovery Protocol compliant HTTP headers.
Args:
content: The response body content
frequency: Update frequency hint for crawlers
Returns:
Dict of HTTP headers
"""
# Content-based hashes
content_bytes = content.encode('utf-8')
sha256_hash = hashlib.sha256(content_bytes).digest()
md5_hash = hashlib.md5(content_bytes).hexdigest()
# Cache durations by frequency
cache_durations = {
"realtime": 300, # 5 minutes
"hourly": 3600, # 1 hour
"daily": 86400, # 24 hours
"weekly": 604800 # 7 days
}
return {
# Cache validation
"ETag": f'W/"{md5_hash}"',
# Content integrity (RFC 9530)
"Content-Digest": f"sha-256=:{base64.b64encode(sha256_hash).decode()}:",
# Crawler scheduling hint
"X-Update-Frequency": frequency,
# Browser/CDN caching
"Cache-Control": f"public, max-age={cache_durations[frequency]}",
# CORS for browser-based AI tools
"Access-Control-Allow-Origin": "*",
"Access-Control-Expose-Headers": "ETag, Content-Digest, X-Update-Frequency",
# Content type
"Content-Type": "text/plain; charset=utf-8"
}
Header Breakdown
| Header | RFC | Purpose |
|---|---|---|
ETag |
RFC 7232 | Weak validator for cache freshness. Crawlers send If-None-Match to skip unchanged content. |
Content-Digest |
RFC 9530 | SHA-256 hash of body for integrity verification. |
X-Update-Frequency |
Custom | Hints how often crawlers should return. |
Cache-Control |
RFC 7234 | Browser and CDN caching directives. |
Access-Control-* |
CORS | Enables browser-based AI tools to fetch content. |
Database Query Optimization
For real-time generation, query efficiency matters:
async def get_published_prs(
limit: int = 50,
since: str = None,
category: str = None
) -> list:
"""
Fetch published press releases optimized for llms.txt.
Only fetches fields needed for the llms.txt format.
"""
supabase = get_supabase_client()
# Start query with minimal field selection
query = supabase.table("press_releases").select(
"id, slug, headline, summary, company_name, category, published_at"
).eq(
"status", "published"
).order(
"published_at", desc=True
).limit(limit)
# Optional filters
if since:
query = query.gte("published_at", since)
if category:
query = query.eq("category", category)
result = query.execute()
return result.data or []
Optimization Strategies
- Select only needed fields: Don't fetch
bodyif you only needsummary - Limit results: 50 items is usually enough for llms.txt context
- Index columns: Ensure
statusandpublished_atare indexed - Connection pooling: Reuse database connections across requests
Query Performance
On our production database with ~1,000 press releases:
| Query | Time |
|---|---|
| Full fetch (all fields) | ~120ms |
| Optimized fetch (6 fields) | ~35ms |
| With category filter | ~28ms |
The optimized query is 3-4x faster than naive fetch.
YAML Frontmatter Generation
The frontmatter is the "metadata about metadata"βit tells AI systems about the document itself:
def generate_yaml_frontmatter(
prs: list,
scope: str = "news-content-only"
) -> str:
"""
Generate YAML frontmatter with computed fields.
"""
now = datetime.utcnow()
# Calculate update frequency based on publication rate
recent_count = sum(
1 for pr in prs
if (now - parse_date(pr['published_at'])).days < 1
)
if recent_count > 5:
frequency = "realtime"
elif recent_count > 0:
frequency = "hourly"
else:
frequency = "daily"
return f"""---
version: 2.9.5
lastModified: {now.isoformat()}Z
totalArticles: {len(prs)}
scope: {scope}
updateFrequency: {frequency}
protocol: AI Discovery Protocol v2.1
generator: Pressonify.ai Dynamic llms.txt v1.0
---"""
Computed Fields Explained
| Field | Type | Purpose |
|---|---|---|
version |
Static | API/content version for compatibility |
lastModified |
Computed | Exact generation timestamp |
totalArticles |
Computed | Database count for context |
scope |
Scoped | What content this file covers |
updateFrequency |
Computed | Based on recent publication rate |
generator |
Static | Identifies the generating system |
Proposed Spec Extension: Scope-Specific llms.txt
Based on our experience, we're proposing an extension to the llms.txt specification.
Problem Statement
Multi-purpose websites have different content types that require different context for AI systems:
- News sites: Articles, breaking news, archives
- E-commerce: Products, categories, reviews
- SaaS platforms: Documentation, blog, changelog
A single /llms.txt file becomes bloated trying to cover everything, or too shallow to be useful for any specific use case.
Proposed Solution: /[scope]/llms.txt
Allow scope-specific llms.txt files at path prefixes:
/llms.txt β Full site overview + links to scoped variants
/news/llms.txt β News content only
/docs/llms.txt β Documentation only
/products/llms.txt β Product catalog only
/blog/llms.txt β Blog posts only
Reference Implementation
Root /llms.txt (links to variants):
---
version: 1.0
lastModified: 2026-01-04T10:00:00Z
hasVariants: true
---
# Pressonify.ai
> AI-powered press release platform
## Scoped Variants
For focused content, see our scope-specific llms.txt files:
- [/news/llms.txt](/news/llms.txt) - Press releases only (realtime)
- [/blog/llms.txt](/blog/llms.txt) - Blog posts (daily)
- [/docs/llms.txt](/docs/llms.txt) - Documentation (weekly)
## Full Site Overview
[General platform description...]
Scoped /news/llms.txt:
---
version: 1.0
lastModified: 2026-01-04T10:00:00Z
scope: news
parent: /llms.txt
---
# Pressonify.ai News
> 247 press releases optimized for AI citation
[News-specific content only...]
Backward Compatibility
This extension is fully backward compatible:
- Existing crawlers that only look for
/llms.txtstill work - The root file can be static (traditional) or dynamic
- Scoped variants are optionalβnot required by the spec
- New crawlers can discover variants via the
hasVariantsfield
Benefits
| Benefit | Description |
|---|---|
| Reduced bloat | Each file focuses on one content type |
| Better relevance | AI systems get scoped context for specific queries |
| Efficient crawling | Crawlers can target specific scopes they care about |
| Independent update frequencies | News can be realtime while docs are weekly |
Testing & Validation
How do we verify this actually works?
1. Crawler Logging
We log every AI crawler hit:
AI_CRAWLER_PATTERNS = [
"GPTBot", "PerplexityBot", "Claude-Web",
"Anthropic", "Google-Extended", "Bingbot",
"CCBot", "ChatGPT-User", "Bytespider",
"Amazonbot", "AppleBot"
]
async def log_crawler_hit(request: Request, endpoint: str):
"""Fire-and-forget crawler logging."""
user_agent = request.headers.get("User-Agent", "")
for pattern in AI_CRAWLER_PATTERNS:
if pattern.lower() in user_agent.lower():
await supabase.table("adp_crawler_hits").insert({
"endpoint": endpoint,
"crawler": pattern,
"timestamp": datetime.utcnow().isoformat()
}).execute()
break
2. Public Stats API
We expose crawler statistics publicly:
GET /api/v1/adp/stats
{
"total_hits": 1247,
"by_endpoint": {
"/llms.txt": 423,
"/news/llms.txt": 312,
"/ai-discovery.json": 289
},
"by_crawler": {
"GPTBot": 456,
"PerplexityBot": 389,
"Claude-Web": 201
}
}
3. Expected Crawl Patterns
After deploying dynamic llms.txt, we observed:
| Crawler | Before | After | Change |
|---|---|---|---|
| PerplexityBot | 2/week | 4/day | +12x |
| GPTBot | 1/week | 2/day | +14x |
| Claude-Web | 3/week | 1/day | +2x |
The increased crawl frequency suggests AI systems recognize the real-time nature of our content.
Results & Learnings
What Worked
- Real-time metadata β Crawlers respect the
X-Update-Frequencyheader - ETag support β 40% of subsequent requests use
If-None-Match - Scoped endpoints β PerplexityBot specifically hits
/news/llms.txt
What Didn't
- Complex filtering β We built
?category=filtering but no crawlers use it yet - Extended YAML fields β Custom fields like
generatorare ignored by current AI systems
Future Enhancements
- Company-specific feeds:
/company/{name}/llms.txt - Time-windowed exports:
/news/llms.txt?since=2026-01-01 - Format negotiation: Return JSON-LD for bots that prefer it
Contributing to the Spec
We're proposing the scope-specific variant pattern to the llms.txt community.
How to Provide Feedback
- GitHub Discussions: llmstxt.org discussions
- Email: [email protected]
- Twitter/X: @pressonify
Our Commitment
We're committed to:
- Sharing our learnings publicly
- Contributing back to the llms.txt spec
- Maintaining backward compatibility
- Open-sourcing reusable components
Full Code Reference
All code from this post is available in our implementation:
- Endpoint:
main.pylines 5140-5200 - Header generation:
app/utils/adp_headers.py - Crawler logging:
app/middleware/crawler_logger.py
For questions or collaboration, reach out at [email protected].
Resources
- llms.txt Specification
- RFC 7232 - Conditional Requests β ETag standard
- RFC 9530 - Digest Fields β Content-Digest
- Part 1: Beyond Static β Business context
- AI Discovery Protocol v2.1 β Our broader ADP
This is Part 2 of a 2-part series on Dynamic llms.txt. Part 1 covers the business rationale and innovation assessment.