Ignacio Belando — Blog

RAG in production — what actually breaks, and why retrieval matters more than the prompt

Tue, 21 Apr 2026 00:00:00 GMT

Every team that ships a RAG system ends up saying some version of "our retrieval is the bottleneck, not the model". After auditing and building several RAG pipelines in production, I can say that's basically always true. The prompt is the last 5% of the work. The retrieval is the 95% that decides whether answers are right.

This post covers what I've learned in production RAG work, including the patterns that consistently fail, the ones that survive, and the cases where RAG is actively the wrong tool.

What "RAG" actually means in practice

Retrieval-Augmented Generation is a pipeline, not a technique. The generic shape:

Ingest: take documents, chunk them, generate embeddings, store in a vector database
Retrieve: on a user query, generate a query embedding, find nearest chunks
Augment: concatenate retrieved chunks into the prompt
Generate: call the LLM with the augmented prompt

Every step has a dozen decisions, and getting any one of them wrong ruins the output. When your RAG system returns garbage, the LLM is rarely the problem — it's that you retrieved the wrong chunks, or chunked the source documents in a way that destroyed meaning, or your query embedding didn't match the document embedding because the user asked differently than the docs were written.

The chunking problem

Chunking is the first big decision and the one most teams get wrong.

Fixed-size chunking (split every N tokens, with overlap) is the default in most tutorials and works for nothing. Slice a legal contract every 512 tokens and you'll split a sentence across chunks, separate a definition from its use, and return a chunk that starts mid-paragraph with no context.

Semantic chunking (split on natural boundaries — paragraphs, sections, headings) works much better for documents with structure. For a product manual, split on H2/H3 headings. For legal text, split on sections. For code documentation, split on function or class boundaries. The chunks become self-contained units of meaning.

Structural chunking (chunks respect the document's hierarchy) adds metadata. Each chunk carries the path from document root: "Terms of Service > Section 4: Payment Terms > Clause 4.2". That metadata goes into the prompt with the chunk, giving the LLM context about where the information came from. This single change often improves answer quality more than swapping the embedding model.

The hidden cost: chunks need to be small enough that several can fit in the context window, but large enough that each chunk is meaningful on its own. I default to 300-800 tokens for prose, 100-300 for code or structured data, and always with at least 50 tokens of overlap to avoid splitting sentences in half. Your mileage varies.

Retrieval quality is the whole game

Once you've chunked reasonably, retrieval becomes the bottleneck. This is where teams spend weeks of engineering time and get marginal gains, because they're optimizing the wrong layer.

Pure vector search (embed the query, find nearest chunks by cosine similarity) works well for semantic matches — "How do I cancel my subscription?" correctly retrieves chunks about subscription cancellation even if the chunks say "end your membership" instead of "cancel". But vector search fails on exact-match queries: if a user asks about "Product SKU-9834", a pure semantic search may return chunks about similar-sounding products instead of the exact one.

Pure keyword search (BM25) does the opposite — great at exact matches, bad at semantic similarity.

Hybrid search (BM25 + vector, then merge with something like reciprocal rank fusion) consistently beats either alone. A simple BM25 + vector hybrid with equal weights already captures most of the gain. If you're running pure vector search in production, this is the first upgrade to make.

Reranking (take top-K results from hybrid, then rerank with a cross-encoder like Cohere's Rerank or Jina's reranker) is the second-biggest improvement after hybrid search. Cross-encoders are slower than bi-encoder retrieval but much more accurate — they look at the query and chunk together, not independently. Rerank the top-20 to top-5 and the quality of the final 5 is dramatically higher.

Metadata filters are the underused lever. If the user query mentions a product, filter chunks to that product's documentation before semantic search. If the query mentions a date range, filter to chunks from that range. Most RAG systems skip this because it requires upfront metadata tagging during ingestion — but the payoff is large and compounds with every other retrieval improvement.

The query-document mismatch problem

Users don't ask questions the way documents are written. They say "Why is my invoice so high?" — the relevant documentation says "Factors affecting monthly billing". Your embedding model needs to connect those two, which is the whole point of semantic search, but it doesn't always work.

Query expansion — take the user query and rewrite it into several alternatives before retrieval — helps close this gap. Use the LLM itself: give it the user query and ask for three alternative phrasings of the same question. Retrieve against all of them, merge the results.

Hypothetical document embedding (HyDE) is the inverse — ask the LLM to generate a hypothetical answer to the query, embed that, and search for real documents similar to the hypothetical answer. Counterintuitive but it works for technical queries where the answer style matches your documentation style better than the question style does.

Fine-tuning embeddings on your domain is the nuclear option. Only worth it if you have a large domain-specific corpus and retrieval quality is still the bottleneck after all other improvements. For most production systems, a good off-the-shelf model (OpenAI text-embedding-3-large, Cohere embed-v4, Voyage ai-large) plus the above techniques gets you there.

Evaluation is mandatory, not optional

The single biggest failure mode of production RAG systems is lack of evaluation. Teams ship the first pipeline that works in demos, then can't tell whether changes are improvements or regressions.

You need an eval harness before you ship, not after. The pattern:

Build a test set of 50-200 representative queries, each paired with the correct answer or the correct retrieved chunks. This takes a day to build and saves months of guessing later.
Measure retrieval quality (did we find the chunks that contain the answer?) separately from generation quality (given the right chunks, did the LLM answer correctly?). These are different failure modes.
Run the eval on every material change — new embedding model, different chunking, added metadata filters, prompt tweaks. Compare scores.

Standard metrics: for retrieval, Recall@K (did the correct chunk appear in the top K retrieved?) and MRR (Mean Reciprocal Rank of the correct chunk). For generation, accuracy against expected answers or LLM-judge scoring (Claude or GPT grading your system's answers against reference answers).

LLM-judge is powerful but biased — the judge LLM has preferences. Run it with multiple judges and average. Or use Claude to judge OpenAI's outputs and vice versa to avoid in-family bias.

When RAG is the wrong tool

This is the section most teams skip but matters most.

When the answer requires reasoning over the entire corpus, RAG struggles. "Which of our policies has been referenced most frequently in customer complaints?" requires aggregating across many documents, not retrieving a few relevant chunks. A SQL query against structured data beats RAG here.

When the corpus fits in the context window, skip RAG. Claude Sonnet supports 200K+ context, Opus supports more. If your knowledge base is 50K tokens, just put it all in the prompt. No chunking, no retrieval, no vector DB. Cheaper to operate, simpler to debug, often higher quality because the model sees everything.

When the data is real-time, RAG with a static vector store is stale. Dashboards, live inventory, current pricing — don't RAG them. Query the source system directly as a tool call, return results to the LLM.

When the domain requires grounded reasoning against authoritative sources (legal, medical), RAG alone is risky. The LLM can still hallucinate even with good retrieval. Add explicit source citations, require the model to quote directly from retrieved chunks, and build human-in-the-loop review for high-stakes answers.

What a production-grade RAG system looks like

Stripped to the essentials, a RAG system that survives real traffic has:

Structural chunking that respects document hierarchy, with chunks tagged by section path
Hybrid retrieval (BM25 + vector search) with metadata pre-filtering
Reranking on the top-K retrieved chunks before passing to the LLM
Query expansion to bridge the user-vs-document phrasing gap
Eval harness run on every change, with separate retrieval and generation metrics
Prompt caching on the system prompt and any stable retrieved content (which compounds cost savings — see my previous post on prompt caching)
Source attribution in the final answer, citing which chunks contributed
Fallback to direct tool calls for queries where structured data beats retrieved text

And critically: a rollout pattern where you can compare retrieval strategies in production without touching the UI. A/B test embeddings, chunking strategies, rerankers. The only way to improve RAG is measurement.

TL;DR

RAG in production is mostly a retrieval problem, not a generation problem. The prompt matters least. The order of investments that actually move quality:

Structural chunking that respects document hierarchy
Hybrid search (BM25 + vector) — never pure vector alone
Reranking with a cross-encoder
Metadata filtering during retrieval
Query expansion or HyDE for query-document mismatch
Eval harness before you ship, not after

And ask yourself before starting: is RAG actually the right tool, or would "all the docs in the context window" or "a SQL query plus the LLM" get you there cheaper?

I'm Ignacio Belando, a freelance senior engineer building Claude and multi-provider LLM integrations for startups and enterprises. If you want help designing or auditing a RAG system, email me or see the Claude API integration service page.

Cutting Claude API costs 40-70% with prompt caching (and when it doesn't help)

Mon, 20 Apr 2026 00:00:00 GMT

Prompt caching is the single biggest cost lever on the Anthropic API. Done right it cuts the bill 40-70% with zero quality change. Done wrong it costs you more than not caching at all — because writing to cache carries a 25% surcharge on that portion of the call.

This post covers what I've learned integrating prompt caching into production Claude pipelines over the last year, including the cases where I've explicitly turned it off. Concrete numbers, the gotchas, and a mental model to decide when to reach for it.

What prompt caching actually is

Anthropic's cache lets you pin a chunk of a prompt — typically the system prompt, tool definitions, and large reference content — so subsequent calls can reuse it instead of charging you for input tokens every time.

The mechanics:

Cache write: costs 1.25× the normal input-token price for the cached portion
Cache hit: costs 0.1× the normal input-token price (a 90% discount)
TTL: 5 minutes from the last hit (extends on each hit)
Minimum: 1024 tokens for Sonnet/Haiku, 2048 for Opus

So the math is: cache pays off when the same prefix is reused at least twice within 5 minutes. Three hits and you're solidly ahead. Below that break-even, you're paying a premium to cache something nobody will reuse.

That "at least twice within 5 minutes" is the number that actually matters, and it's the one most teams don't measure.

Where it wins big

The clearest wins in my own integrations have been:

1. Conversational agents with a fat system prompt

Typical shape: a 4K-token system prompt defining tools, personas, safety rails, plus 8K-20K tokens of context documents. Every user message reuses the same prefix.

Without caching, every turn re-charges the full prefix. With caching, turn 1 pays a 25% surcharge on ~15K tokens. Turns 2-N pay 90% less.

Real numbers from a support-assistant pipeline I audited: 680 daily conversations averaging 5 turns each. Input cost dropped from ~$145/day to ~$42/day — 71% reduction — by caching the system prompt and tool definitions.

2. RAG pipelines with a stable knowledge base

If your RAG injects the same 10-15K tokens of retrieved context into each call of a multi-turn Q&A session, caching those retrieved chunks pays off fast.

Counter-intuitively, caching fails if you rewrite the retrieved context each time ("here are the 5 most relevant docs for this specific question"). The prefix has to be byte-identical for cache hits. Reshuffling chunks or adding dynamic metadata breaks the cache.

The fix: retrieve a stable superset at session start, cache it, and do the question-specific narrowing inside the uncached portion. You pay for slightly more tokens per retrieve but hit cache on every subsequent turn.

3. Structured output pipelines with long schemas

If you're forcing structured JSON output with a 2-3K-token schema definition, and you're calling the model hundreds of times per hour, cache the schema. The savings compound.

4. Long document processing with shared instructions

Analyzing many documents against the same rubric? Cache the rubric. Each document call only charges for the document tokens (uncached) plus a cheap cache hit for the rubric.

Where it does nothing — or actively hurts

This is the part most posts on caching skip.

Single-shot calls

A user sends one question, gets one answer, leaves. No second call within 5 minutes. You paid 1.25× for the cache write, never collected the 0.1× discount. Net cost: +25%.

Rule of thumb: caching is for sessions, not for stateless request/response.

High variability in the cached region

Teams sometimes stuff per-user context into the cache — name, tenant ID, timestamp. That makes every user's cache unique, which defeats pooling. At low per-user volume the cache never warms.

Move user-specific content after the cache boundary. Cache the shared system prompt and tool definitions; leave the user context uncached.

Cache fragmentation

I've seen teams accidentally fragment their cache by threading timestamps or session IDs into the system prompt. Every session writes a new cache entry, none get reused. The fix is boring: audit the exact bytes of your cached prefix across calls and eliminate any variable content.

A quick diagnostic: if cache_creation_input_tokens is consistently higher than cache_read_input_tokens in your usage reports, your cache isn't pooling. Either your prefix is too unstable, or your traffic doesn't justify caching.

Very short prefixes

Below the minimum (1024 tokens for Sonnet) the cache silently does nothing. The call goes through as uncached, no error, you just don't see hits. If you configured caching but logs show zero hits, check the token count of the cached portion first.

A mental model

Before adding cache_control anywhere, I ask:

Is this prefix reused more than twice within 5 minutes? If no, skip caching.
Is the prefix byte-identical between calls? If no, refactor before caching.
Is it over the minimum token threshold? If no, the cache does nothing.
Is the variable part after the cached part? If not, fix the prompt structure.

Only after yes/yes/yes/yes do I add cache_control: { type: "ephemeral" } to the block.

Measuring whether it actually works

Anthropic returns three fields in every response's usage object that matter:

cache_creation_input_tokens — tokens you wrote to cache (charged at 1.25×)
cache_read_input_tokens — tokens read from cache (charged at 0.1×)
input_tokens — uncached tokens (charged at 1.0×)

Your cache hit rate is roughly cache_read / (cache_read + cache_creation). A healthy production integration sits above 0.80. Below 0.50, something is off — either your prefix isn't stable, or traffic is too low for the TTL window.

I log these three fields on every call and chart them daily. When the ratio drops, it's usually because someone added a variable into the cached region (version strings, timestamps, A/B test flags). The fix is always "move the variable part further down the prompt".

Five-minute TTL is shorter than you think

The TTL only extends on hits to that specific cache entry, not on writes of new entries. For a low-traffic service where users arrive sporadically, the cache keeps expiring between conversations.

Real example: an internal tool with 40 queries per hour across 8 users meant each user had maybe one call every 12 minutes. TTL expired between every pair of calls. The integration wasn't hitting cache — it was constantly rewriting it at 1.25×.

The fix depends on your context:

If traffic is variable: Anthropic's 1-hour cache tier (available on higher tiers) extends the window
If the prefix is universal across users: the cache pools across all of them, so even low per-user volume gets cache hits
If neither applies: caching isn't paying off and you should accept uncached calls

Prompt caching is not a substitute for prompt discipline

I keep seeing teams reach for caching to paper over a bloated prompt. You cached 20K tokens of system prompt? Great — but half of that 20K is probably dead instructions the model never needed. Caching the bloat is cheaper than paying full price for the bloat, but both are worse than deleting the bloat.

Order of operations:

Shrink the prompt — cut unused tools, merge redundant rules, move examples into retrieval
Structure for cacheability — stable prefix first, variables after
Then add caching to the stable prefix

Inverting this order is how teams end up with a 40K-token system prompt that caches well but produces worse outputs than a 4K-token version would.

TL;DR

Prompt caching is a high-leverage optimization when:

Same prefix repeats 2+ times within 5 minutes
Prefix is byte-identical between calls
Prefix is over 1024 tokens
Variable content lives after the cached block
You're actually measuring cache_read_input_tokens in usage data

It's actively harmful when:

Calls are stateless one-shots
Per-user or per-request variables live in the cached region
Traffic is too low to hit within TTL
You're using it to paper over a bloated prompt that should be shrunk first

Start by measuring your current cache hit rate before adding more cache blocks. That one number tells you whether the optimization is working.

I'm Ignacio Belando, a freelance senior engineer building Claude and multi-provider LLM integrations for startups and enterprises. If you want an audit of your current Claude integration or help designing one from scratch, email me or see the Claude API integration service page.

Docker Compose Error

Fri, 13 Dec 2019 00:00:00 GMT

Problem

Recently while updating with Skela with webpack, I encountered a weird error where I wasn't able to run a simple script:

bin/composer

#!/bin/bash
docker-compose exec -w /var/www/html/wp-content/themes/skela wordpress composer "$@"

When trying to run this script via ./bin/composer install, I got this error in my terminal:

ERROR: Setting workdir for exec is not supported in API < 1.35 (1.30)

The error was coming from the -w flag in the docker-compose exec command in the composer script.

Solution

Turns The fix was to update the version in my docker-compose.yml file to from version 3.5 to 3.6. It's strange because 3.5 isn't anywhere close to the API version 1.35 from the error message 🤷‍♀️

docker-compose.yml

version: '3.6'services:
  wordpress:
    build: