Skip to Projects
Blog

RAG in production — what actually breaks, and why retrieval matters more than the prompt

 · 9 min read — #RAG#Claude#LLM#Retrieval#Vector Search#Production

Every team that ships a RAG system ends up saying some version of "our retrieval is the bottleneck, not the model". After auditing and building several RAG pipelines in production, I can say that's basically always true. The prompt is the last 5% of the work. The retrieval is the 95% that decides whether answers are right.

This post covers what I've learned in production RAG work, including the patterns that consistently fail, the ones that survive, and the cases where RAG is actively the wrong tool.

What "RAG" actually means in practice

Retrieval-Augmented Generation is a pipeline, not a technique. The generic shape:

  1. Ingest: take documents, chunk them, generate embeddings, store in a vector database
  2. Retrieve: on a user query, generate a query embedding, find nearest chunks
  3. Augment: concatenate retrieved chunks into the prompt
  4. Generate: call the LLM with the augmented prompt

Every step has a dozen decisions, and getting any one of them wrong ruins the output. When your RAG system returns garbage, the LLM is rarely the problem — it's that you retrieved the wrong chunks, or chunked the source documents in a way that destroyed meaning, or your query embedding didn't match the document embedding because the user asked differently than the docs were written.

The chunking problem

Chunking is the first big decision and the one most teams get wrong.

Fixed-size chunking (split every N tokens, with overlap) is the default in most tutorials and works for nothing. Slice a legal contract every 512 tokens and you'll split a sentence across chunks, separate a definition from its use, and return a chunk that starts mid-paragraph with no context.

Semantic chunking (split on natural boundaries — paragraphs, sections, headings) works much better for documents with structure. For a product manual, split on H2/H3 headings. For legal text, split on sections. For code documentation, split on function or class boundaries. The chunks become self-contained units of meaning.

Structural chunking (chunks respect the document's hierarchy) adds metadata. Each chunk carries the path from document root: "Terms of Service > Section 4: Payment Terms > Clause 4.2". That metadata goes into the prompt with the chunk, giving the LLM context about where the information came from. This single change often improves answer quality more than swapping the embedding model.

The hidden cost: chunks need to be small enough that several can fit in the context window, but large enough that each chunk is meaningful on its own. I default to 300-800 tokens for prose, 100-300 for code or structured data, and always with at least 50 tokens of overlap to avoid splitting sentences in half. Your mileage varies.

Retrieval quality is the whole game

Once you've chunked reasonably, retrieval becomes the bottleneck. This is where teams spend weeks of engineering time and get marginal gains, because they're optimizing the wrong layer.

Pure vector search (embed the query, find nearest chunks by cosine similarity) works well for semantic matches — "How do I cancel my subscription?" correctly retrieves chunks about subscription cancellation even if the chunks say "end your membership" instead of "cancel". But vector search fails on exact-match queries: if a user asks about "Product SKU-9834", a pure semantic search may return chunks about similar-sounding products instead of the exact one.

Pure keyword search (BM25) does the opposite — great at exact matches, bad at semantic similarity.

Hybrid search (BM25 + vector, then merge with something like reciprocal rank fusion) consistently beats either alone. A simple BM25 + vector hybrid with equal weights already captures most of the gain. If you're running pure vector search in production, this is the first upgrade to make.

Reranking (take top-K results from hybrid, then rerank with a cross-encoder like Cohere's Rerank or Jina's reranker) is the second-biggest improvement after hybrid search. Cross-encoders are slower than bi-encoder retrieval but much more accurate — they look at the query and chunk together, not independently. Rerank the top-20 to top-5 and the quality of the final 5 is dramatically higher.

Metadata filters are the underused lever. If the user query mentions a product, filter chunks to that product's documentation before semantic search. If the query mentions a date range, filter to chunks from that range. Most RAG systems skip this because it requires upfront metadata tagging during ingestion — but the payoff is large and compounds with every other retrieval improvement.

The query-document mismatch problem

Users don't ask questions the way documents are written. They say "Why is my invoice so high?" — the relevant documentation says "Factors affecting monthly billing". Your embedding model needs to connect those two, which is the whole point of semantic search, but it doesn't always work.

Query expansion — take the user query and rewrite it into several alternatives before retrieval — helps close this gap. Use the LLM itself: give it the user query and ask for three alternative phrasings of the same question. Retrieve against all of them, merge the results.

Hypothetical document embedding (HyDE) is the inverse — ask the LLM to generate a hypothetical answer to the query, embed that, and search for real documents similar to the hypothetical answer. Counterintuitive but it works for technical queries where the answer style matches your documentation style better than the question style does.

Fine-tuning embeddings on your domain is the nuclear option. Only worth it if you have a large domain-specific corpus and retrieval quality is still the bottleneck after all other improvements. For most production systems, a good off-the-shelf model (OpenAI text-embedding-3-large, Cohere embed-v4, Voyage ai-large) plus the above techniques gets you there.

Evaluation is mandatory, not optional

The single biggest failure mode of production RAG systems is lack of evaluation. Teams ship the first pipeline that works in demos, then can't tell whether changes are improvements or regressions.

You need an eval harness before you ship, not after. The pattern:

  1. Build a test set of 50-200 representative queries, each paired with the correct answer or the correct retrieved chunks. This takes a day to build and saves months of guessing later.
  2. Measure retrieval quality (did we find the chunks that contain the answer?) separately from generation quality (given the right chunks, did the LLM answer correctly?). These are different failure modes.
  3. Run the eval on every material change — new embedding model, different chunking, added metadata filters, prompt tweaks. Compare scores.

Standard metrics: for retrieval, Recall@K (did the correct chunk appear in the top K retrieved?) and MRR (Mean Reciprocal Rank of the correct chunk). For generation, accuracy against expected answers or LLM-judge scoring (Claude or GPT grading your system's answers against reference answers).

LLM-judge is powerful but biased — the judge LLM has preferences. Run it with multiple judges and average. Or use Claude to judge OpenAI's outputs and vice versa to avoid in-family bias.

When RAG is the wrong tool

This is the section most teams skip but matters most.

When the answer requires reasoning over the entire corpus, RAG struggles. "Which of our policies has been referenced most frequently in customer complaints?" requires aggregating across many documents, not retrieving a few relevant chunks. A SQL query against structured data beats RAG here.

When the corpus fits in the context window, skip RAG. Claude Sonnet supports 200K+ context, Opus supports more. If your knowledge base is 50K tokens, just put it all in the prompt. No chunking, no retrieval, no vector DB. Cheaper to operate, simpler to debug, often higher quality because the model sees everything.

When the data is real-time, RAG with a static vector store is stale. Dashboards, live inventory, current pricing — don't RAG them. Query the source system directly as a tool call, return results to the LLM.

When the domain requires grounded reasoning against authoritative sources (legal, medical), RAG alone is risky. The LLM can still hallucinate even with good retrieval. Add explicit source citations, require the model to quote directly from retrieved chunks, and build human-in-the-loop review for high-stakes answers.

What a production-grade RAG system looks like

Stripped to the essentials, a RAG system that survives real traffic has:

  • Structural chunking that respects document hierarchy, with chunks tagged by section path
  • Hybrid retrieval (BM25 + vector search) with metadata pre-filtering
  • Reranking on the top-K retrieved chunks before passing to the LLM
  • Query expansion to bridge the user-vs-document phrasing gap
  • Eval harness run on every change, with separate retrieval and generation metrics
  • Prompt caching on the system prompt and any stable retrieved content (which compounds cost savings — see my previous post on prompt caching)
  • Source attribution in the final answer, citing which chunks contributed
  • Fallback to direct tool calls for queries where structured data beats retrieved text

And critically: a rollout pattern where you can compare retrieval strategies in production without touching the UI. A/B test embeddings, chunking strategies, rerankers. The only way to improve RAG is measurement.

TL;DR

RAG in production is mostly a retrieval problem, not a generation problem. The prompt matters least. The order of investments that actually move quality:

  1. Structural chunking that respects document hierarchy
  2. Hybrid search (BM25 + vector) — never pure vector alone
  3. Reranking with a cross-encoder
  4. Metadata filtering during retrieval
  5. Query expansion or HyDE for query-document mismatch
  6. Eval harness before you ship, not after

And ask yourself before starting: is RAG actually the right tool, or would "all the docs in the context window" or "a SQL query plus the LLM" get you there cheaper?


I'm Ignacio Belando, a freelance senior engineer building Claude and multi-provider LLM integrations for startups and enterprises. If you want help designing or auditing a RAG system, email me or see the Claude API integration service page.