IntermediateClaude API

Cut Claude API costs 90% with prompt caching

The single optimization that pays for itself the first day.

Prompt caching lets Claude reuse a previously-processed prompt prefix at a fraction of the input cost. Here's how it works, when to use it, and the patterns that compound.

11 min read
apioptimizationcostcaching

If you're building anything on the Claude API that sends the same long context repeatedly — a system prompt, a knowledge base, a few documents — you should be using prompt caching. It is the single biggest cost lever in the API and it takes about 30 seconds to enable.

What it actually does

When you mark part of your prompt as cacheable, Anthropic stores the processed prompt prefix on their side. The next time you send a request that starts with the same prefix, Claude reuses the cached version instead of re-processing it.

The economics:

  • Cache write (first request): ~25% more expensive than a normal input token.
  • Cache read (every subsequent hit within ~5 minutes): ~10% of normal input cost.

So if you send the same 10,000-token system prompt a hundred times, naive cost is 100× the input price. With caching, it's roughly 1.25× (the first write) plus 99 × 0.1× (the reads) = about 11× the input price. A 90% reduction.

How to enable it

In the Anthropic SDK, you mark a content block as cacheable with cache_control:

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)

That's it. The first call writes the cache; the next call within the TTL reads from it. Claude tracks the cache key automatically based on the content.

When the win is biggest

Caching pays off most when:

  • The same prefix gets sent many times in a short window. The cache TTL is around 5 minutes by default; if your traffic is bursty (a chat session, a batch job), you hit it constantly.
  • The prefix is long. Caching a 200-token system prompt saves you nothing meaningful. Caching a 50,000-token document corpus saves you real money.
  • The prefix is stable. If you regenerate the prompt with a new timestamp on every call, you'll never get a hit.

Concrete patterns where it shines:

  • Customer support bots with a long policy/FAQ system prompt, hit by many users.
  • RAG systems where the retrieved documents are long and the user follows up multiple times in the same session.
  • Agent loops where the same tool definitions and instructions are sent on every step.
  • Batch processing of documents against a shared rubric.

Patterns that compound

1. Cache the boring middle. Put your stable content (system prompt, tool defs, retrieved docs) at the start of the prompt and cache it. Put the volatile content (user message, current state) after the cache breakpoint, where it doesn't invalidate anything.

2. Stack cache breakpoints. You can mark up to four points as cacheable. Use this for tiered staleness — system prompt (cached for hours of usage), then knowledge base (cached per-conversation), then conversation history (cached per-turn).

3. Watch the response. The API returns cache_creation_input_tokens and cache_read_input_tokens in the usage block. If you don't see reads, your prefix is changing between calls — find what.

print(response.usage)
# Usage(input_tokens=120, cache_creation_input_tokens=0,
#       cache_read_input_tokens=8400, output_tokens=312)

4. Pair with the Batch API. If you're processing thousands of items against a shared prompt, the Batch API gives you another 50% off on top of caching. Compounding.

What it doesn't do

  • It doesn't cache outputs. Same prompt → still generates a fresh response. (For idempotent caching, do that yourself.)
  • It doesn't help across different models. A cached prefix for Sonnet won't be reused by Opus.
  • It doesn't survive long idle gaps. If nothing hits the cache for ~5 minutes, it expires.

Where to go next

  • Read the Anthropic prompt caching docs for the exact pricing table and SDK details.
  • For a deeper dive into Claude API patterns, browse the Anthropic Cookbook.
  • If you're building agents, pair caching with tool use — the tool definitions are exactly the kind of stable prefix that benefits.

Keep learning

Apply this with Waymaker

Get this article surfaced where you work

Inside Waymaker, this article shows up next to the right Signal page — so the lesson lands when you need it, not before.

No credit card required.