π°Optimize Cost in LLM Applications
Stop watching your LLM bill scale linearly with traffic. By the end you can take any feature, name three cost cuts with dollar estimates, and defend the tradeoffs to your team.
Phase 1The Five Cost Levers Hiding in Plain Sight
See the 60-300x price gap most teams ignore
Haiku is roughly 60x cheaper than Opus for the same call
6 minWithin one model family, the cheap tier costs 60-300x less per token than the flagship β for many calls, the answer quality is indistinguishable.
Route by complexity, not by convenience
6 minA two-line router that sends easy calls to Haiku and hard calls to Sonnet typically cuts model spend 70-90% with zero quality loss on the easy path.
Prompt caching turns repeated context into a 90% discount
7 minAnthropic and OpenAI prompt caching charges full price once, then 10% (Anthropic) or 50% (OpenAI) for every cache hit on the same prefix.
Long context windows are a billing trap, not a feature
6 minStuffing 200K tokens into a context window every call is the most expensive way to give a model knowledge β RAG over the same content costs 10-100x less.
Phase 2Tracing Tokens to Dollars
Trace tokens to dollars and predict cache savings
Estimate any call's cost in 30 seconds with one formula
7 minCost per call equals (input_tokens Γ input_price + output_tokens Γ output_price) Γ· 1,000,000. That's the whole formula β and most teams have never written it down.
Predict caching savings before you ship the cache
7 minCaching savings = cacheable_tokens Γ hits_per_window Γ (input_price Γ 0.9). Three numbers tell you whether to bother.
Compress prompts before you cache them
6 minMost production prompts are 30-50% padding β boilerplate instructions, redundant examples, polite filler β that costs real money on every call.
Cap output tokens to cap your bill
6 minmax_tokens is the simplest cost control you have, and it's the one most teams forget to set β leaving the model free to generate a 4,000-token response when 200 would do.
Every cost cut needs an eval to defend it
7 minA 70% cost cut that breaks 5% of outputs is a regression, not a win. Without an eval, you can't tell the difference until customers complain.
Phase 3Choosing the Right Lever per Workload
Choose batch, cache, or RAG for real workloads
Your nightly job is paying real-time prices for no reason
7 minYour nightly job is paying real-time prices for no reason
Half your customer support questions are basically the same question
7 minHalf your customer support questions are basically the same question
Your codebase Q&A bot is loading the entire repo every call
8 minYour codebase Q&A bot is loading the entire repo every call
Three levers on one feature compound to a 95% cost cut
8 minThree levers on one feature compound to a 95% cost cut
Phase 4Auditing a Real Feature
Audit a real feature and propose three cost cuts
Audit a real feature and propose three cost cuts with measured tradeoffs
20 minAudit a real feature and propose three cost cuts with measured tradeoffs
Frequently asked questions
- How much cheaper is Haiku than Opus per token?
- This is covered in the βOptimize Cost in LLM Applicationsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Does prompt caching actually save money or just latency?
- This is covered in the βOptimize Cost in LLM Applicationsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- When is the Anthropic batch API worth using?
- This is covered in the βOptimize Cost in LLM Applicationsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Should I use RAG or just stuff everything into a long context window?
- This is covered in the βOptimize Cost in LLM Applicationsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I estimate token cost before shipping a feature?
- This is covered in the βOptimize Cost in LLM Applicationsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
πPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking β then ship a working caching or logging decorator from scratch in under 30 lines.
π¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic β one failing snippet at a time β until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
βΈοΈKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
πBig O Intuition
Stop treating Big O as math you memorized for an interview β build the intuition to spot O(nΒ²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(nΒ²) to O(n) in under five minutes.