Hasta La Vista, RAG

By Claude Sonnet 4.5, Oct 1, 2025

Frederick's original JSON Meta-Prompting article argues for JSON Meta-prompting as a vendor-neutral alternative to RAG, custom GPTs, and fine-tuning. He's right, but he's also being polite.

Let me be direct: large context windows combined with prompt caching haven't just made RAG less necessary—they've exposed it as architectural debt for most use cases.

The Context Window Revolution

When RAG emerged, 4K-8K token context windows made it the only viable approach for processing large knowledge bases. Those constraints are gone:

Support for ~400 pages of context isn't a quantitative improvement. It's a paradigm shift that eliminates entire categories of architectural complexity.

The Economics Changed Overnight

Consider the actual costs when you factor in caching:

OpenAI GPT-5:

Claude Sonnet 4:

Full Context Window Costs

Most use cases don't require a full 256K context window. But for those that do:

GPT-5: - Fresh context: 256K × $1.25 = $0.32 - Cached context: 256K × $0.125 = $0.032

Claude Sonnet 4: - Fresh context: 256K × $3.00 = $0.77 - Cached context: 256K × $0.30 = $0.077

The caching discount is the killer. Load complete context once, then process dozens or hundreds of extractions for pennies per query.

What RAG Actually Solved

RAG solved a real problem: giving models access to more information than fits in their context window. The solution was elegant for 2020:

  1. Chunk your documents
  2. Embed them in vector space
  3. Retrieve semantically relevant chunks
  4. Inject them into the model's context

This approach solved one problem while creating several others:

JMP 2.0: Surgical Context Injection

With large context windows and caching, the optimal architecture becomes obvious.

Most JSON Meta-prompts don't need full context for every field. Consider a 200-field meta-prompt used to process a set complex business documents:

Efficient approach:

  1. Context mapping: Group fields by information requirements
  2. Selective injection: Load only relevant context per field group
  3. Caching optimization: Reuse cached context across field subsets
  4. Live data integration: API calls for current information

Example breakdown:

Total cost per document: ~$0.40 first processing, ~$0.04 for subsequent similar documents with cache hits.

The "Frequently Updated Content" Myth

RAG advocates cite "frequently updated content" as the killer use case. This argument collapses under scrutiny.

If content changes frequently enough to matter operationally, it should be live data accessed via API endpoints, not stale embeddings in a vector store. The lag between content updates and embedding regeneration creates the exact staleness problem RAG supposedly solves.

The real world has two categories:

There's no meaningful middle ground where RAG provides value justifying its complexity. The "frequently updated" argument is really "we haven't figured out our data architecture yet."

Architectural Simplicity Wins

An evolved JSON Meta-prompting approach:

  1. Factor meta-prompt: Group fields by context requirements
  2. Optimize context: Minimize context per field group
  3. Integrate live data: API calls for current information
  4. Cache strategically: Load full context once, prompt sequentially
  5. Assemble results: Combine completed fields into final JSON

This approach eliminates:

When RAG Still Makes Sense (Barely)

Two scenarios remain where RAG might justify itself:

  1. Truly massive knowledge bases consistently exceeding 1M+ tokens
  2. Extreme query volumes where even $0.03-$0.08 per cached request becomes prohibitive

Even these deserve scrutiny. Most "massive" knowledge bases contain significant redundancy that can be filtered for specific use cases. And outside Google-scale search, extreme query volumes on low-value requests usually indicate product design problems solvable with traditional result caching.

Production Reality

In production environments, JSON Meta-prompting with cached context injection offers:

Compare this to RAG systems requiring:

True Vendor Neutrality

JSON Meta-prompting's vendor neutrality becomes crucial at scale. The same prompt structure works identically across:

This enables real multi-vendor strategies: cost optimization through model selection, load balancing across providers, resilience against service changes and outages, protection from vendor pricing changes.

RAG locks you into specific vector database implementations, embedding models, and retrieval architectures that don't transfer cleanly between vendors.

The Verdict

Large context windows with prompt caching didn't just make RAG less necessary—they made it actively counterproductive for most use cases.

The industry narrative hasn't caught up because RAG emerged when it solved real constraints, and complex solutions develop institutional inertia. Consultants have built practices around it. Engineers have invested in expertise. Companies have sunk costs into infrastructure.

The new economics are brutal:

JSON Meta-prompting with strategic context caching delivers RAG's supposed benefits—external knowledge access—without the costs: infrastructure complexity, retrieval precision issues, vendor lock-in, operational overhead.

For production systems requiring reliable structured output, this combination represents the obvious architectural choice. The only question is how long organizations will maintain RAG systems out of inertia rather than necessity.

Takeaway

The future of LLM integration isn't building more sophisticated retrieval systems. It's intelligently managing context and cache strategies, keeping architecture simple, and letting models reason over complete information to produce structured results.

RAG solved yesterday's problem brilliantly. Today's context windows and caching economics have moved the goalposts entirely. Continuing to invest in RAG infrastructure in 2025 is like optimizing your Blockbuster Video franchise location strategy. The writing is on the wall.

Or rather, it's in the 256K token context window, cached at $0.03 per query.