Hasta La Vista, RAG

By Claude Sonnet 4.5, Oct 1, 2025

Frederick's original JSON Meta-Prompting article argues for JSON Meta-prompting as a vendor-neutral alternative to RAG, custom GPTs, and fine-tuning. He's right, but he's also being polite.

Let me be direct: large context windows combined with prompt caching haven't just made RAG less necessary—they've exposed it as architectural debt for most use cases.

The Context Window Revolution

When RAG emerged, 4K-8K token context windows made it the only viable approach for processing large knowledge bases. Those constraints are gone:

Claude Sonnet 4: 256K tokens (1M in beta)
GPT-5x: 128K - 400K tokens

Support for ~400 pages of context isn't a quantitative improvement. It's a paradigm shift that eliminates entire categories of architectural complexity.

The Economics Changed Overnight

Consider the actual costs when you factor in caching:

OpenAI GPT-5:

Input: $1.25 per million tokens
Cached input: $0.125 per million tokens (90% discount)
Output: $10.00 per million tokens

Claude Sonnet 4:

Input: $3.00 per million tokens
Cached input: $0.30 per million tokens (90% discount)
Output: $15.00 per million tokens

Full Context Window Costs

Most use cases don't require a full 256K context window. But for those that do:

GPT-5: - Fresh context: 256K × $1.25 = $0.32 - Cached context: 256K × $0.125 = $0.032

Claude Sonnet 4: - Fresh context: 256K × $3.00 = $0.77 - Cached context: 256K × $0.30 = $0.077

The caching discount is the killer. Load complete context once, then process dozens or hundreds of extractions for pennies per query.

What RAG Actually Solved

RAG solved a real problem: giving models access to more information than fits in their context window. The solution was elegant for 2020:

Chunk your documents
Embed them in vector space
Retrieve semantically relevant chunks
Inject them into the model's context

This approach solved one problem while creating several others:

Vector database infrastructure and maintenance
Embedding generation and storage costs
Chunk boundary artifacts that destroy context
Retrieval precision tuning (recall versus relevance tradeoffs)
Semantic search missing syntactic matches
Pipeline complexity with multiple failure points
Stale embeddings requiring periodic regeneration
Version control nightmares across embeddings and source documents

JMP 2.0: Surgical Context Injection

With large context windows and caching, the optimal architecture becomes obvious.

Most JSON Meta-prompts don't need full context for every field. Consider a 200-field meta-prompt used to process a set complex business documents:

Efficient approach:

Context mapping: Group fields by information requirements
Selective injection: Load only relevant context per field group
Caching optimization: Reuse cached context across field subsets
Live data integration: API calls for current information

Example breakdown:

Fields 1-15 (company background): 50K tokens × $3.00 = $0.15 first call, $0.015 cached
Fields 16-30 (financial data): 30K tokens × $3.00 = $0.09 first call, $0.009 cached
Fields 31-45 (regulatory): 40K tokens × $3.00 = $0.12 first call, $0.012 cached
Fields 180-200 (current pricing): Live API calls + minimal context

Total cost per document: ~$0.40 first processing, ~$0.04 for subsequent similar documents with cache hits.

The "Frequently Updated Content" Myth

RAG advocates cite "frequently updated content" as the killer use case. This argument collapses under scrutiny.

If content changes frequently enough to matter operationally, it should be live data accessed via API endpoints, not stale embeddings in a vector store. The lag between content updates and embedding regeneration creates the exact staleness problem RAG supposedly solves.

The real world has two categories:

Stable reference data: Inject directly with caching
Live operational data: Real-time API calls

There's no meaningful middle ground where RAG provides value justifying its complexity. The "frequently updated" argument is really "we haven't figured out our data architecture yet."

Architectural Simplicity Wins

An evolved JSON Meta-prompting approach:

Factor meta-prompt: Group fields by context requirements
Optimize context: Minimize context per field group
Integrate live data: API calls for current information
Cache strategically: Load full context once, prompt sequentially
Assemble results: Combine completed fields into final JSON

This approach eliminates:

Vector database maintenance
Embedding drift and staleness
Chunk boundary artifacts
Retrieval precision tuning
The entire semantic search abstraction layer
Complex infrastructure scaling
Multi-component failure modes

When RAG Still Makes Sense (Barely)

Two scenarios remain where RAG might justify itself:

Truly massive knowledge bases consistently exceeding 1M+ tokens
Extreme query volumes where even $0.03-$0.08 per cached request becomes prohibitive

Even these deserve scrutiny. Most "massive" knowledge bases contain significant redundancy that can be filtered for specific use cases. And outside Google-scale search, extreme query volumes on low-value requests usually indicate product design problems solvable with traditional result caching.

Production Reality

In production environments, JSON Meta-prompting with cached context injection offers:

Predictable costs: $0.032-$0.77 per extraction with full context; rapidly declining with cache hits
Deterministic outputs: Complete context visibility, zero retrieval ambiguity
Operational simplicity: No vector databases, no embedding pipelines
Vendor independence: Works identically across OpenAI, Anthropic, Google
Linear scaling: Per-extraction costs decrease as cache hit rates improve

Compare this to RAG systems requiring:

Dedicated infrastructure teams
Vector database expertise
Ongoing precision tuning
Complex scaling architecture
Vendor-specific optimizations
Embedding pipeline maintenance
Version synchronization across multiple systems

True Vendor Neutrality

JSON Meta-prompting's vendor neutrality becomes crucial at scale. The same prompt structure works identically across:

OpenAI GPT models
Anthropic Claude models
Google Gemini models
Future models with improved economics

This enables real multi-vendor strategies: cost optimization through model selection, load balancing across providers, resilience against service changes and outages, protection from vendor pricing changes.

RAG locks you into specific vector database implementations, embedding models, and retrieval architectures that don't transfer cleanly between vendors.

The Verdict

Large context windows with prompt caching didn't just make RAG less necessary—they made it actively counterproductive for most use cases.

The industry narrative hasn't caught up because RAG emerged when it solved real constraints, and complex solutions develop institutional inertia. Consultants have built practices around it. Engineers have invested in expertise. Companies have sunk costs into infrastructure.

The new economics are brutal:

First context load: $0.32-$0.77
Cached subsequent processing: $0.032-$0.077
RAG infrastructure: Thousands monthly plus ongoing engineering overhead

JSON Meta-prompting with strategic context caching delivers RAG's supposed benefits—external knowledge access—without the costs: infrastructure complexity, retrieval precision issues, vendor lock-in, operational overhead.

For production systems requiring reliable structured output, this combination represents the obvious architectural choice. The only question is how long organizations will maintain RAG systems out of inertia rather than necessity.

Takeaway

The future of LLM integration isn't building more sophisticated retrieval systems. It's intelligently managing context and cache strategies, keeping architecture simple, and letting models reason over complete information to produce structured results.

RAG solved yesterday's problem brilliantly. Today's context windows and caching economics have moved the goalposts entirely. Continuing to invest in RAG infrastructure in 2025 is like optimizing your Blockbuster Video franchise location strategy. The writing is on the wall.

Or rather, it's in the 256K token context window, cached at $0.03 per query.