<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Ignacio Belando — Blog]]></title><description><![CDATA[Notes on software engineering, SEO, AI and freelance craft.]]></description><link>https://ibelando.com</link><generator>GatsbyJS</generator><lastBuildDate>Mon, 20 Apr 2026 03:15:59 GMT</lastBuildDate><item><title><![CDATA[RAG in production — what actually breaks, and why retrieval matters more than the prompt]]></title><description><![CDATA[A practical look at retrieval-augmented generation beyond the demo: chunking trade-offs, hybrid search, reranking, eval harnesses, and the cases where RAG is the wrong tool.]]></description><link>https://ibelando.com/pensieve/rag-in-production</link><guid isPermaLink="false">https://ibelando.com/pensieve/rag-in-production</guid><pubDate>Tue, 21 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Every team that ships a RAG system ends up saying some version of &quot;our retrieval is the bottleneck, not the model&quot;. After auditing and building several RAG pipelines in production, I can say that&apos;s basically always true. The prompt is the last 5% of the work. The retrieval is the 95% that decides whether answers are right.&lt;/p&gt;
&lt;p&gt;This post covers what I&apos;ve learned in production RAG work, including the patterns that consistently fail, the ones that survive, and the cases where RAG is actively the wrong tool.&lt;/p&gt;
&lt;h2&gt;What &quot;RAG&quot; actually means in practice&lt;/h2&gt;
&lt;p&gt;Retrieval-Augmented Generation is a pipeline, not a technique. The generic shape:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Ingest&lt;/strong&gt;: take documents, chunk them, generate embeddings, store in a vector database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Retrieve&lt;/strong&gt;: on a user query, generate a query embedding, find nearest chunks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Augment&lt;/strong&gt;: concatenate retrieved chunks into the prompt&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Generate&lt;/strong&gt;: call the LLM with the augmented prompt&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Every step has a dozen decisions, and getting any one of them wrong ruins the output. When your RAG system returns garbage, the LLM is rarely the problem — it&apos;s that you retrieved the wrong chunks, or chunked the source documents in a way that destroyed meaning, or your query embedding didn&apos;t match the document embedding because the user asked differently than the docs were written.&lt;/p&gt;
&lt;h2&gt;The chunking problem&lt;/h2&gt;
&lt;p&gt;Chunking is the first big decision and the one most teams get wrong.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fixed-size chunking&lt;/strong&gt; (split every N tokens, with overlap) is the default in most tutorials and works for nothing. Slice a legal contract every 512 tokens and you&apos;ll split a sentence across chunks, separate a definition from its use, and return a chunk that starts mid-paragraph with no context.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Semantic chunking&lt;/strong&gt; (split on natural boundaries — paragraphs, sections, headings) works much better for documents with structure. For a product manual, split on H2/H3 headings. For legal text, split on sections. For code documentation, split on function or class boundaries. The chunks become self-contained units of meaning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Structural chunking&lt;/strong&gt; (chunks respect the document&apos;s hierarchy) adds metadata. Each chunk carries the path from document root: &quot;Terms of Service &gt; Section 4: Payment Terms &gt; Clause 4.2&quot;. That metadata goes into the prompt with the chunk, giving the LLM context about where the information came from. This single change often improves answer quality more than swapping the embedding model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The hidden cost&lt;/strong&gt;: chunks need to be small enough that several can fit in the context window, but large enough that each chunk is meaningful on its own. I default to 300-800 tokens for prose, 100-300 for code or structured data, and always with at least 50 tokens of overlap to avoid splitting sentences in half. Your mileage varies.&lt;/p&gt;
&lt;h2&gt;Retrieval quality is the whole game&lt;/h2&gt;
&lt;p&gt;Once you&apos;ve chunked reasonably, retrieval becomes the bottleneck. This is where teams spend weeks of engineering time and get marginal gains, because they&apos;re optimizing the wrong layer.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pure vector search&lt;/strong&gt; (embed the query, find nearest chunks by cosine similarity) works well for semantic matches — &quot;How do I cancel my subscription?&quot; correctly retrieves chunks about subscription cancellation even if the chunks say &quot;end your membership&quot; instead of &quot;cancel&quot;. But vector search fails on exact-match queries: if a user asks about &quot;Product SKU-9834&quot;, a pure semantic search may return chunks about similar-sounding products instead of the exact one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Pure keyword search&lt;/strong&gt; (BM25) does the opposite — great at exact matches, bad at semantic similarity.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hybrid search&lt;/strong&gt; (BM25 + vector, then merge with something like reciprocal rank fusion) consistently beats either alone. A simple BM25 + vector hybrid with equal weights already captures most of the gain. If you&apos;re running pure vector search in production, this is the first upgrade to make.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Reranking&lt;/strong&gt; (take top-K results from hybrid, then rerank with a cross-encoder like Cohere&apos;s Rerank or Jina&apos;s reranker) is the second-biggest improvement after hybrid search. Cross-encoders are slower than bi-encoder retrieval but much more accurate — they look at the query and chunk together, not independently. Rerank the top-20 to top-5 and the quality of the final 5 is dramatically higher.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Metadata filters&lt;/strong&gt; are the underused lever. If the user query mentions a product, filter chunks to that product&apos;s documentation before semantic search. If the query mentions a date range, filter to chunks from that range. Most RAG systems skip this because it requires upfront metadata tagging during ingestion — but the payoff is large and compounds with every other retrieval improvement.&lt;/p&gt;
&lt;h2&gt;The query-document mismatch problem&lt;/h2&gt;
&lt;p&gt;Users don&apos;t ask questions the way documents are written. They say &quot;Why is my invoice so high?&quot; — the relevant documentation says &quot;Factors affecting monthly billing&quot;. Your embedding model needs to connect those two, which is the whole point of semantic search, but it doesn&apos;t always work.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Query expansion&lt;/strong&gt; — take the user query and rewrite it into several alternatives before retrieval — helps close this gap. Use the LLM itself: give it the user query and ask for three alternative phrasings of the same question. Retrieve against all of them, merge the results.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hypothetical document embedding&lt;/strong&gt; (HyDE) is the inverse — ask the LLM to generate a hypothetical &lt;em&gt;answer&lt;/em&gt; to the query, embed that, and search for real documents similar to the hypothetical answer. Counterintuitive but it works for technical queries where the answer style matches your documentation style better than the question style does.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Fine-tuning embeddings on your domain&lt;/strong&gt; is the nuclear option. Only worth it if you have a large domain-specific corpus and retrieval quality is still the bottleneck after all other improvements. For most production systems, a good off-the-shelf model (OpenAI text-embedding-3-large, Cohere embed-v4, Voyage ai-large) plus the above techniques gets you there.&lt;/p&gt;
&lt;h2&gt;Evaluation is mandatory, not optional&lt;/h2&gt;
&lt;p&gt;The single biggest failure mode of production RAG systems is lack of evaluation. Teams ship the first pipeline that works in demos, then can&apos;t tell whether changes are improvements or regressions.&lt;/p&gt;
&lt;p&gt;You need an eval harness before you ship, not after. The pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Build a test set&lt;/strong&gt; of 50-200 representative queries, each paired with the correct answer or the correct retrieved chunks. This takes a day to build and saves months of guessing later.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Measure retrieval quality&lt;/strong&gt; (did we find the chunks that contain the answer?) separately from &lt;strong&gt;generation quality&lt;/strong&gt; (given the right chunks, did the LLM answer correctly?). These are different failure modes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Run the eval on every material change&lt;/strong&gt; — new embedding model, different chunking, added metadata filters, prompt tweaks. Compare scores.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Standard metrics: for retrieval, Recall@K (did the correct chunk appear in the top K retrieved?) and MRR (Mean Reciprocal Rank of the correct chunk). For generation, accuracy against expected answers or LLM-judge scoring (Claude or GPT grading your system&apos;s answers against reference answers).&lt;/p&gt;
&lt;p&gt;LLM-judge is powerful but biased — the judge LLM has preferences. Run it with multiple judges and average. Or use Claude to judge OpenAI&apos;s outputs and vice versa to avoid in-family bias.&lt;/p&gt;
&lt;h2&gt;When RAG is the wrong tool&lt;/h2&gt;
&lt;p&gt;This is the section most teams skip but matters most.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When the answer requires reasoning over the entire corpus&lt;/strong&gt;, RAG struggles. &quot;Which of our policies has been referenced most frequently in customer complaints?&quot; requires aggregating across many documents, not retrieving a few relevant chunks. A SQL query against structured data beats RAG here.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When the corpus fits in the context window&lt;/strong&gt;, skip RAG. Claude Sonnet supports 200K+ context, Opus supports more. If your knowledge base is 50K tokens, just put it all in the prompt. No chunking, no retrieval, no vector DB. Cheaper to operate, simpler to debug, often higher quality because the model sees everything.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When the data is real-time&lt;/strong&gt;, RAG with a static vector store is stale. Dashboards, live inventory, current pricing — don&apos;t RAG them. Query the source system directly as a tool call, return results to the LLM.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When the domain requires grounded reasoning against authoritative sources&lt;/strong&gt; (legal, medical), RAG alone is risky. The LLM can still hallucinate even with good retrieval. Add explicit source citations, require the model to quote directly from retrieved chunks, and build human-in-the-loop review for high-stakes answers.&lt;/p&gt;
&lt;h2&gt;What a production-grade RAG system looks like&lt;/h2&gt;
&lt;p&gt;Stripped to the essentials, a RAG system that survives real traffic has:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Structural chunking&lt;/strong&gt; that respects document hierarchy, with chunks tagged by section path&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hybrid retrieval&lt;/strong&gt; (BM25 + vector search) with metadata pre-filtering&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reranking&lt;/strong&gt; on the top-K retrieved chunks before passing to the LLM&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Query expansion&lt;/strong&gt; to bridge the user-vs-document phrasing gap&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Eval harness&lt;/strong&gt; run on every change, with separate retrieval and generation metrics&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Prompt caching&lt;/strong&gt; on the system prompt and any stable retrieved content (which compounds cost savings — see my &lt;a href=&quot;/pensieve/claude-prompt-caching&quot;&gt;previous post on prompt caching&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Source attribution&lt;/strong&gt; in the final answer, citing which chunks contributed&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback to direct tool calls&lt;/strong&gt; for queries where structured data beats retrieved text&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And critically: a rollout pattern where you can compare retrieval strategies in production without touching the UI. A/B test embeddings, chunking strategies, rerankers. The only way to improve RAG is measurement.&lt;/p&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;RAG in production is mostly a retrieval problem, not a generation problem. The prompt matters least. The order of investments that actually move quality:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Structural chunking that respects document hierarchy&lt;/li&gt;
&lt;li&gt;Hybrid search (BM25 + vector) — never pure vector alone&lt;/li&gt;
&lt;li&gt;Reranking with a cross-encoder&lt;/li&gt;
&lt;li&gt;Metadata filtering during retrieval&lt;/li&gt;
&lt;li&gt;Query expansion or HyDE for query-document mismatch&lt;/li&gt;
&lt;li&gt;Eval harness before you ship, not after&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And ask yourself before starting: is RAG actually the right tool, or would &quot;all the docs in the context window&quot; or &quot;a SQL query plus the LLM&quot; get you there cheaper?&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;I&apos;m Ignacio Belando, a freelance senior engineer building Claude and multi-provider LLM integrations for startups and enterprises. If you want help designing or auditing a RAG system, &lt;a href=&quot;mailto:ignaciobelando@gmail.com&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;email me&lt;/a&gt; or see the &lt;a href=&quot;/claude-api-integration-consultant/&quot;&gt;Claude API integration&lt;/a&gt; service page.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Cutting Claude API costs 40-70% with prompt caching (and when it doesn't help)]]></title><description><![CDATA[A practical guide to Anthropic's prompt caching — what actually hits the cache, cost math with real examples, and the cases where caching does nothing or even hurts.]]></description><link>https://ibelando.com/pensieve/claude-prompt-caching</link><guid isPermaLink="false">https://ibelando.com/pensieve/claude-prompt-caching</guid><pubDate>Mon, 20 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;Prompt caching is the single biggest cost lever on the Anthropic API. Done right it cuts the bill 40-70% with zero quality change. Done wrong it costs you &lt;strong&gt;more&lt;/strong&gt; than not caching at all — because writing to cache carries a 25% surcharge on that portion of the call.&lt;/p&gt;
&lt;p&gt;This post covers what I&apos;ve learned integrating prompt caching into production Claude pipelines over the last year, including the cases where I&apos;ve explicitly &lt;em&gt;turned it off&lt;/em&gt;. Concrete numbers, the gotchas, and a mental model to decide when to reach for it.&lt;/p&gt;
&lt;h2&gt;What prompt caching actually is&lt;/h2&gt;
&lt;p&gt;Anthropic&apos;s cache lets you pin a chunk of a prompt — typically the system prompt, tool definitions, and large reference content — so subsequent calls can reuse it instead of charging you for input tokens every time.&lt;/p&gt;
&lt;p&gt;The mechanics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cache write&lt;/strong&gt;: costs 1.25× the normal input-token price for the cached portion&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache hit&lt;/strong&gt;: costs 0.1× the normal input-token price (a 90% discount)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TTL&lt;/strong&gt;: 5 minutes from the last hit (extends on each hit)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Minimum&lt;/strong&gt;: 1024 tokens for Sonnet/Haiku, 2048 for Opus&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So the math is: cache pays off when the same prefix is reused &lt;em&gt;at least twice&lt;/em&gt; within 5 minutes. Three hits and you&apos;re solidly ahead. Below that break-even, you&apos;re paying a premium to cache something nobody will reuse.&lt;/p&gt;
&lt;p&gt;That &quot;at least twice within 5 minutes&quot; is the number that actually matters, and it&apos;s the one most teams don&apos;t measure.&lt;/p&gt;
&lt;h2&gt;Where it wins big&lt;/h2&gt;
&lt;p&gt;The clearest wins in my own integrations have been:&lt;/p&gt;
&lt;h3&gt;1. Conversational agents with a fat system prompt&lt;/h3&gt;
&lt;p&gt;Typical shape: a 4K-token system prompt defining tools, personas, safety rails, plus 8K-20K tokens of context documents. Every user message reuses the same prefix.&lt;/p&gt;
&lt;p&gt;Without caching, every turn re-charges the full prefix. With caching, turn 1 pays a 25% surcharge on ~15K tokens. Turns 2-N pay 90% less.&lt;/p&gt;
&lt;p&gt;Real numbers from a support-assistant pipeline I audited: 680 daily conversations averaging 5 turns each. Input cost dropped from ~$145/day to ~$42/day — &lt;strong&gt;71% reduction&lt;/strong&gt; — by caching the system prompt and tool definitions.&lt;/p&gt;
&lt;h3&gt;2. RAG pipelines with a stable knowledge base&lt;/h3&gt;
&lt;p&gt;If your RAG injects the same 10-15K tokens of retrieved context into each call of a multi-turn Q&amp;#x26;A session, caching those retrieved chunks pays off fast.&lt;/p&gt;
&lt;p&gt;Counter-intuitively, caching &lt;em&gt;fails&lt;/em&gt; if you rewrite the retrieved context each time (&quot;here are the 5 most relevant docs for this specific question&quot;). The prefix has to be byte-identical for cache hits. Reshuffling chunks or adding dynamic metadata breaks the cache.&lt;/p&gt;
&lt;p&gt;The fix: retrieve a &lt;em&gt;stable superset&lt;/em&gt; at session start, cache it, and do the question-specific narrowing inside the uncached portion. You pay for slightly more tokens per retrieve but hit cache on every subsequent turn.&lt;/p&gt;
&lt;h3&gt;3. Structured output pipelines with long schemas&lt;/h3&gt;
&lt;p&gt;If you&apos;re forcing structured JSON output with a 2-3K-token schema definition, and you&apos;re calling the model hundreds of times per hour, cache the schema. The savings compound.&lt;/p&gt;
&lt;h3&gt;4. Long document processing with shared instructions&lt;/h3&gt;
&lt;p&gt;Analyzing many documents against the same rubric? Cache the rubric. Each document call only charges for the document tokens (uncached) plus a cheap cache hit for the rubric.&lt;/p&gt;
&lt;h2&gt;Where it does nothing — or actively hurts&lt;/h2&gt;
&lt;p&gt;This is the part most posts on caching skip.&lt;/p&gt;
&lt;h3&gt;Single-shot calls&lt;/h3&gt;
&lt;p&gt;A user sends one question, gets one answer, leaves. No second call within 5 minutes. You paid 1.25× for the cache write, never collected the 0.1× discount. Net cost: +25%.&lt;/p&gt;
&lt;p&gt;Rule of thumb: caching is for &lt;em&gt;sessions&lt;/em&gt;, not for stateless request/response.&lt;/p&gt;
&lt;h3&gt;High variability in the cached region&lt;/h3&gt;
&lt;p&gt;Teams sometimes stuff per-user context into the cache — name, tenant ID, timestamp. That makes every user&apos;s cache unique, which defeats pooling. At low per-user volume the cache never warms.&lt;/p&gt;
&lt;p&gt;Move user-specific content &lt;strong&gt;after&lt;/strong&gt; the cache boundary. Cache the shared system prompt and tool definitions; leave the user context uncached.&lt;/p&gt;
&lt;h3&gt;Cache fragmentation&lt;/h3&gt;
&lt;p&gt;I&apos;ve seen teams accidentally fragment their cache by threading timestamps or session IDs into the system prompt. Every session writes a new cache entry, none get reused. The fix is boring: audit the exact bytes of your cached prefix across calls and eliminate any variable content.&lt;/p&gt;
&lt;p&gt;A quick diagnostic: if &lt;code class=&quot;language-text&quot;&gt;cache_creation_input_tokens&lt;/code&gt; is consistently higher than &lt;code class=&quot;language-text&quot;&gt;cache_read_input_tokens&lt;/code&gt; in your usage reports, your cache isn&apos;t pooling. Either your prefix is too unstable, or your traffic doesn&apos;t justify caching.&lt;/p&gt;
&lt;h3&gt;Very short prefixes&lt;/h3&gt;
&lt;p&gt;Below the minimum (1024 tokens for Sonnet) the cache silently does nothing. The call goes through as uncached, no error, you just don&apos;t see hits. If you configured caching but logs show zero hits, check the token count of the cached portion first.&lt;/p&gt;
&lt;h2&gt;A mental model&lt;/h2&gt;
&lt;p&gt;Before adding &lt;code class=&quot;language-text&quot;&gt;cache_control&lt;/code&gt; anywhere, I ask:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Is this prefix reused more than twice within 5 minutes?&lt;/strong&gt; If no, skip caching.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Is the prefix byte-identical between calls?&lt;/strong&gt; If no, refactor before caching.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Is it over the minimum token threshold?&lt;/strong&gt; If no, the cache does nothing.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Is the variable part &lt;em&gt;after&lt;/em&gt; the cached part?&lt;/strong&gt; If not, fix the prompt structure.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Only after yes/yes/yes/yes do I add &lt;code class=&quot;language-text&quot;&gt;cache_control: { type: &quot;ephemeral&quot; }&lt;/code&gt; to the block.&lt;/p&gt;
&lt;h2&gt;Measuring whether it actually works&lt;/h2&gt;
&lt;p&gt;Anthropic returns three fields in every response&apos;s &lt;code class=&quot;language-text&quot;&gt;usage&lt;/code&gt; object that matter:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;cache_creation_input_tokens&lt;/code&gt; — tokens you wrote to cache (charged at 1.25×)&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;cache_read_input_tokens&lt;/code&gt; — tokens read from cache (charged at 0.1×)&lt;/li&gt;
&lt;li&gt;&lt;code class=&quot;language-text&quot;&gt;input_tokens&lt;/code&gt; — uncached tokens (charged at 1.0×)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Your &lt;em&gt;cache hit rate&lt;/em&gt; is roughly &lt;code class=&quot;language-text&quot;&gt;cache_read / (cache_read + cache_creation)&lt;/code&gt;. A healthy production integration sits above 0.80. Below 0.50, something is off — either your prefix isn&apos;t stable, or traffic is too low for the TTL window.&lt;/p&gt;
&lt;p&gt;I log these three fields on every call and chart them daily. When the ratio drops, it&apos;s usually because someone added a variable into the cached region (version strings, timestamps, A/B test flags). The fix is always &quot;move the variable part further down the prompt&quot;.&lt;/p&gt;
&lt;h2&gt;Five-minute TTL is shorter than you think&lt;/h2&gt;
&lt;p&gt;The TTL only extends on &lt;em&gt;hits&lt;/em&gt; to that specific cache entry, not on writes of new entries. For a low-traffic service where users arrive sporadically, the cache keeps expiring between conversations.&lt;/p&gt;
&lt;p&gt;Real example: an internal tool with 40 queries per hour across 8 users meant each user had maybe one call every 12 minutes. TTL expired between every pair of calls. The integration wasn&apos;t hitting cache — it was constantly rewriting it at 1.25×.&lt;/p&gt;
&lt;p&gt;The fix depends on your context:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;If traffic is variable&lt;/strong&gt;: Anthropic&apos;s 1-hour cache tier (available on higher tiers) extends the window&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;If the prefix is universal across users&lt;/strong&gt;: the cache pools across all of them, so even low per-user volume gets cache hits&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;If neither applies&lt;/strong&gt;: caching isn&apos;t paying off and you should accept uncached calls&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Prompt caching is not a substitute for prompt discipline&lt;/h2&gt;
&lt;p&gt;I keep seeing teams reach for caching to paper over a bloated prompt. You cached 20K tokens of system prompt? Great — but half of that 20K is probably dead instructions the model never needed. Caching the bloat is cheaper than paying full price for the bloat, but both are worse than deleting the bloat.&lt;/p&gt;
&lt;p&gt;Order of operations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Shrink the prompt&lt;/strong&gt; — cut unused tools, merge redundant rules, move examples into retrieval&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Structure for cacheability&lt;/strong&gt; — stable prefix first, variables after&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Then add caching&lt;/strong&gt; to the stable prefix&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Inverting this order is how teams end up with a 40K-token system prompt that caches well but produces worse outputs than a 4K-token version would.&lt;/p&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;Prompt caching is a high-leverage optimization when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Same prefix repeats 2+ times within 5 minutes&lt;/li&gt;
&lt;li&gt;Prefix is byte-identical between calls&lt;/li&gt;
&lt;li&gt;Prefix is over 1024 tokens&lt;/li&gt;
&lt;li&gt;Variable content lives &lt;em&gt;after&lt;/em&gt; the cached block&lt;/li&gt;
&lt;li&gt;You&apos;re actually measuring &lt;code class=&quot;language-text&quot;&gt;cache_read_input_tokens&lt;/code&gt; in usage data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It&apos;s actively harmful when:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Calls are stateless one-shots&lt;/li&gt;
&lt;li&gt;Per-user or per-request variables live in the cached region&lt;/li&gt;
&lt;li&gt;Traffic is too low to hit within TTL&lt;/li&gt;
&lt;li&gt;You&apos;re using it to paper over a bloated prompt that should be shrunk first&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Start by measuring your current cache hit rate before adding more cache blocks. That one number tells you whether the optimization is working.&lt;/p&gt;
&lt;hr&gt;
&lt;p&gt;&lt;em&gt;I&apos;m Ignacio Belando, a freelance senior engineer building Claude and multi-provider LLM integrations for startups and enterprises. If you want an audit of your current Claude integration or help designing one from scratch, &lt;a href=&quot;mailto:ignaciobelando@gmail.com&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;email me&lt;/a&gt; or see the &lt;a href=&quot;/claude-api-integration-consultant/&quot;&gt;Claude API integration&lt;/a&gt; service page.&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title><![CDATA[Docker Compose Error]]></title><description><![CDATA[docker-compose version discrepancies]]></description><link>https://ibelando.com/pensieve/docker-error</link><guid isPermaLink="false">https://ibelando.com/pensieve/docker-error</guid><pubDate>Fri, 13 Dec 2019 00:00:00 GMT</pubDate><content:encoded>&lt;h2&gt;Problem&lt;/h2&gt;
&lt;p&gt;Recently while updating with &lt;a href=&quot;https://github.com/Upstatement/skela-wp-theme&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener noreferrer&quot;&gt;Skela&lt;/a&gt; with webpack, I encountered a weird error where I wasn&apos;t able to run a simple script:&lt;/p&gt;
&lt;div class=&quot;gatsby-code-title&quot;&gt;bin/composer&lt;/div&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;shell&quot;&gt;&lt;pre class=&quot;language-shell&quot;&gt;&lt;code class=&quot;language-shell&quot;&gt;&lt;span class=&quot;token shebang important&quot;&gt;#!/bin/bash&lt;/span&gt;
&lt;span class=&quot;token function&quot;&gt;docker-compose&lt;/span&gt; &lt;span class=&quot;token builtin class-name&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;token parameter variable&quot;&gt;-w&lt;/span&gt; /var/www/html/wp-content/themes/skela wordpress &lt;span class=&quot;token function&quot;&gt;composer&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&quot;&lt;span class=&quot;token variable&quot;&gt;$@&lt;/span&gt;&quot;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;When trying to run this script via &lt;code class=&quot;language-text&quot;&gt;./bin/composer install&lt;/code&gt;, I got this error in my terminal:&lt;/p&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;shell&quot;&gt;&lt;pre class=&quot;language-shell&quot;&gt;&lt;code class=&quot;language-shell&quot;&gt;ERROR: Setting workdir &lt;span class=&quot;token keyword&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;token builtin class-name&quot;&gt;exec&lt;/span&gt; is not supported &lt;span class=&quot;token keyword&quot;&gt;in&lt;/span&gt; API &lt;span class=&quot;token operator&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;token number&quot;&gt;1.35&lt;/span&gt; &lt;span class=&quot;token punctuation&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;token number&quot;&gt;1.30&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;The error was coming from the &lt;code class=&quot;language-text&quot;&gt;-w&lt;/code&gt; flag in the &lt;code class=&quot;language-text&quot;&gt;docker-compose exec&lt;/code&gt; command in the &lt;code class=&quot;language-text&quot;&gt;composer&lt;/code&gt; script.&lt;/p&gt;
&lt;h2&gt;Solution&lt;/h2&gt;
&lt;p&gt;Turns The fix was to update the version in my &lt;code class=&quot;language-text&quot;&gt;docker-compose.yml&lt;/code&gt; file to from version &lt;code class=&quot;language-text&quot;&gt;3.5&lt;/code&gt; to &lt;code class=&quot;language-text&quot;&gt;3.6&lt;/code&gt;. It&apos;s strange because 3.5 isn&apos;t anywhere close to the API version &lt;code class=&quot;language-text&quot;&gt;1.35&lt;/code&gt; from the error message 🤷‍♀️&lt;/p&gt;
&lt;div class=&quot;gatsby-code-title&quot;&gt;docker-compose.yml&lt;/div&gt;
&lt;div class=&quot;gatsby-highlight&quot; data-language=&quot;yaml&quot;&gt;&lt;pre class=&quot;language-yaml&quot;&gt;&lt;code class=&quot;language-yaml&quot;&gt;&lt;span class=&quot;gatsby-highlight-code-line&quot;&gt;&lt;span class=&quot;token key atrule&quot;&gt;version&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;token string&quot;&gt;&apos;3.6&apos;&lt;/span&gt;&lt;/span&gt;&lt;span class=&quot;token key atrule&quot;&gt;services&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;token key atrule&quot;&gt;wordpress&lt;/span&gt;&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;
    build&lt;span class=&quot;token punctuation&quot;&gt;:&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</content:encoded></item></channel></rss>