Cutting Claude API costs 40-70% with prompt caching (and when it doesn't help)
· 8 min read — #Claude#Anthropic#LLM#Prompt Caching#Cost Optimization
Prompt caching is the single biggest cost lever on the Anthropic API. Done right it cuts the bill 40-70% with zero quality change. Done wrong it costs you more than not caching at all — because writing to cache carries a 25% surcharge on that portion of the call.
This post covers what I've learned integrating prompt caching into production Claude pipelines over the last year, including the cases where I've explicitly turned it off. Concrete numbers, the gotchas, and a mental model to decide when to reach for it.
What prompt caching actually is
Anthropic's cache lets you pin a chunk of a prompt — typically the system prompt, tool definitions, and large reference content — so subsequent calls can reuse it instead of charging you for input tokens every time.
The mechanics:
- Cache write: costs 1.25× the normal input-token price for the cached portion
- Cache hit: costs 0.1× the normal input-token price (a 90% discount)
- TTL: 5 minutes from the last hit (extends on each hit)
- Minimum: 1024 tokens for Sonnet/Haiku, 2048 for Opus
So the math is: cache pays off when the same prefix is reused at least twice within 5 minutes. Three hits and you're solidly ahead. Below that break-even, you're paying a premium to cache something nobody will reuse.
That "at least twice within 5 minutes" is the number that actually matters, and it's the one most teams don't measure.
Where it wins big
The clearest wins in my own integrations have been:
1. Conversational agents with a fat system prompt
Typical shape: a 4K-token system prompt defining tools, personas, safety rails, plus 8K-20K tokens of context documents. Every user message reuses the same prefix.
Without caching, every turn re-charges the full prefix. With caching, turn 1 pays a 25% surcharge on ~15K tokens. Turns 2-N pay 90% less.
Real numbers from a support-assistant pipeline I audited: 680 daily conversations averaging 5 turns each. Input cost dropped from ~$145/day to ~$42/day — 71% reduction — by caching the system prompt and tool definitions.
2. RAG pipelines with a stable knowledge base
If your RAG injects the same 10-15K tokens of retrieved context into each call of a multi-turn Q&A session, caching those retrieved chunks pays off fast.
Counter-intuitively, caching fails if you rewrite the retrieved context each time ("here are the 5 most relevant docs for this specific question"). The prefix has to be byte-identical for cache hits. Reshuffling chunks or adding dynamic metadata breaks the cache.
The fix: retrieve a stable superset at session start, cache it, and do the question-specific narrowing inside the uncached portion. You pay for slightly more tokens per retrieve but hit cache on every subsequent turn.
3. Structured output pipelines with long schemas
If you're forcing structured JSON output with a 2-3K-token schema definition, and you're calling the model hundreds of times per hour, cache the schema. The savings compound.
4. Long document processing with shared instructions
Analyzing many documents against the same rubric? Cache the rubric. Each document call only charges for the document tokens (uncached) plus a cheap cache hit for the rubric.
Where it does nothing — or actively hurts
This is the part most posts on caching skip.
Single-shot calls
A user sends one question, gets one answer, leaves. No second call within 5 minutes. You paid 1.25× for the cache write, never collected the 0.1× discount. Net cost: +25%.
Rule of thumb: caching is for sessions, not for stateless request/response.
High variability in the cached region
Teams sometimes stuff per-user context into the cache — name, tenant ID, timestamp. That makes every user's cache unique, which defeats pooling. At low per-user volume the cache never warms.
Move user-specific content after the cache boundary. Cache the shared system prompt and tool definitions; leave the user context uncached.
Cache fragmentation
I've seen teams accidentally fragment their cache by threading timestamps or session IDs into the system prompt. Every session writes a new cache entry, none get reused. The fix is boring: audit the exact bytes of your cached prefix across calls and eliminate any variable content.
A quick diagnostic: if cache_creation_input_tokens is consistently higher than cache_read_input_tokens in your usage reports, your cache isn't pooling. Either your prefix is too unstable, or your traffic doesn't justify caching.
Very short prefixes
Below the minimum (1024 tokens for Sonnet) the cache silently does nothing. The call goes through as uncached, no error, you just don't see hits. If you configured caching but logs show zero hits, check the token count of the cached portion first.
A mental model
Before adding cache_control anywhere, I ask:
- Is this prefix reused more than twice within 5 minutes? If no, skip caching.
- Is the prefix byte-identical between calls? If no, refactor before caching.
- Is it over the minimum token threshold? If no, the cache does nothing.
- Is the variable part after the cached part? If not, fix the prompt structure.
Only after yes/yes/yes/yes do I add cache_control: { type: "ephemeral" } to the block.
Measuring whether it actually works
Anthropic returns three fields in every response's usage object that matter:
cache_creation_input_tokens— tokens you wrote to cache (charged at 1.25×)cache_read_input_tokens— tokens read from cache (charged at 0.1×)input_tokens— uncached tokens (charged at 1.0×)
Your cache hit rate is roughly cache_read / (cache_read + cache_creation). A healthy production integration sits above 0.80. Below 0.50, something is off — either your prefix isn't stable, or traffic is too low for the TTL window.
I log these three fields on every call and chart them daily. When the ratio drops, it's usually because someone added a variable into the cached region (version strings, timestamps, A/B test flags). The fix is always "move the variable part further down the prompt".
Five-minute TTL is shorter than you think
The TTL only extends on hits to that specific cache entry, not on writes of new entries. For a low-traffic service where users arrive sporadically, the cache keeps expiring between conversations.
Real example: an internal tool with 40 queries per hour across 8 users meant each user had maybe one call every 12 minutes. TTL expired between every pair of calls. The integration wasn't hitting cache — it was constantly rewriting it at 1.25×.
The fix depends on your context:
- If traffic is variable: Anthropic's 1-hour cache tier (available on higher tiers) extends the window
- If the prefix is universal across users: the cache pools across all of them, so even low per-user volume gets cache hits
- If neither applies: caching isn't paying off and you should accept uncached calls
Prompt caching is not a substitute for prompt discipline
I keep seeing teams reach for caching to paper over a bloated prompt. You cached 20K tokens of system prompt? Great — but half of that 20K is probably dead instructions the model never needed. Caching the bloat is cheaper than paying full price for the bloat, but both are worse than deleting the bloat.
Order of operations:
- Shrink the prompt — cut unused tools, merge redundant rules, move examples into retrieval
- Structure for cacheability — stable prefix first, variables after
- Then add caching to the stable prefix
Inverting this order is how teams end up with a 40K-token system prompt that caches well but produces worse outputs than a 4K-token version would.
TL;DR
Prompt caching is a high-leverage optimization when:
- Same prefix repeats 2+ times within 5 minutes
- Prefix is byte-identical between calls
- Prefix is over 1024 tokens
- Variable content lives after the cached block
- You're actually measuring
cache_read_input_tokensin usage data
It's actively harmful when:
- Calls are stateless one-shots
- Per-user or per-request variables live in the cached region
- Traffic is too low to hit within TTL
- You're using it to paper over a bloated prompt that should be shrunk first
Start by measuring your current cache hit rate before adding more cache blocks. That one number tells you whether the optimization is working.
I'm Ignacio Belando, a freelance senior engineer building Claude and multi-provider LLM integrations for startups and enterprises. If you want an audit of your current Claude integration or help designing one from scratch, email me or see the Claude API integration service page.