Back to library

πŸ’°Optimize Cost in LLM Applications

Stop watching your LLM bill scale linearly with traffic. By the end you can take any feature, name three cost cuts with dollar estimates, and defend the tradeoffs to your team.

Applied14 drops~2-week path Β· 5–8 min/daytechnology

Phase 1The Five Cost Levers Hiding in Plain Sight

See the 60-300x price gap most teams ignore

4 drops
  1. Haiku is roughly 60x cheaper than Opus for the same call

    6 min

    Within one model family, the cheap tier costs 60-300x less per token than the flagship β€” for many calls, the answer quality is indistinguishable.

  2. Route by complexity, not by convenience

    6 min

    A two-line router that sends easy calls to Haiku and hard calls to Sonnet typically cuts model spend 70-90% with zero quality loss on the easy path.

  3. Prompt caching turns repeated context into a 90% discount

    7 min

    Anthropic and OpenAI prompt caching charges full price once, then 10% (Anthropic) or 50% (OpenAI) for every cache hit on the same prefix.

  4. Long context windows are a billing trap, not a feature

    6 min

    Stuffing 200K tokens into a context window every call is the most expensive way to give a model knowledge β€” RAG over the same content costs 10-100x less.

Phase 2Tracing Tokens to Dollars

Trace tokens to dollars and predict cache savings

5 drops
  1. Estimate any call's cost in 30 seconds with one formula

    7 min

    Cost per call equals (input_tokens Γ— input_price + output_tokens Γ— output_price) Γ· 1,000,000. That's the whole formula β€” and most teams have never written it down.

  2. Predict caching savings before you ship the cache

    7 min

    Caching savings = cacheable_tokens Γ— hits_per_window Γ— (input_price Γ— 0.9). Three numbers tell you whether to bother.

  3. Compress prompts before you cache them

    6 min

    Most production prompts are 30-50% padding β€” boilerplate instructions, redundant examples, polite filler β€” that costs real money on every call.

  4. Cap output tokens to cap your bill

    6 min

    max_tokens is the simplest cost control you have, and it's the one most teams forget to set β€” leaving the model free to generate a 4,000-token response when 200 would do.

  5. Every cost cut needs an eval to defend it

    7 min

    A 70% cost cut that breaks 5% of outputs is a regression, not a win. Without an eval, you can't tell the difference until customers complain.

Phase 3Choosing the Right Lever per Workload

Choose batch, cache, or RAG for real workloads

4 drops
  1. Your nightly job is paying real-time prices for no reason

    7 min

    Your nightly job is paying real-time prices for no reason

  2. Half your customer support questions are basically the same question

    7 min

    Half your customer support questions are basically the same question

  3. Your codebase Q&A bot is loading the entire repo every call

    8 min

    Your codebase Q&A bot is loading the entire repo every call

  4. Three levers on one feature compound to a 95% cost cut

    8 min

    Three levers on one feature compound to a 95% cost cut

Phase 4Auditing a Real Feature

Audit a real feature and propose three cost cuts

1 drop
  1. Audit a real feature and propose three cost cuts with measured tradeoffs

    20 min

    Audit a real feature and propose three cost cuts with measured tradeoffs

Frequently asked questions

How much cheaper is Haiku than Opus per token?
This is covered in the β€œOptimize Cost in LLM Applications” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Does prompt caching actually save money or just latency?
This is covered in the β€œOptimize Cost in LLM Applications” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When is the Anthropic batch API worth using?
This is covered in the β€œOptimize Cost in LLM Applications” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Should I use RAG or just stuff everything into a long context window?
This is covered in the β€œOptimize Cost in LLM Applications” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I estimate token cost before shipping a feature?
This is covered in the β€œOptimize Cost in LLM Applications” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.