πLearn LLM Observability Fundamentals
Stop finding out about LLM regressions from angry user emails. By the end you'll know what to log on every call, which tools fit which signal, and how to sketch one dashboard an on-call engineer can read at 3am.
Phase 1Why Generic APM Misses LLM-Shaped Problems
See why generic APM misses LLM-shaped problems
Your APM is green and your LLM feature is broken
6 minTraditional APM tracks latency and errors; LLMs fail silently with bad outputs while every request returns 200 OK.
Logs, metrics, traces β but the units are different
6 minThe three-pillars model still applies; you just swap the underlying signals from HTTP/SQL to prompt, response, tokens, cost, latency, and eval score.
Three failure modes APM was never built to see
7 minSilent quality regressions, prompt drift, and cost spikes are the three failure modes unique to LLM systems β none show up on a generic APM dashboard.
Six signals to log on every LLM call
6 minPrompt (with version), response, tokens in/out, cost, latency, and a quality signal β capture these on every call and 80% of LLM observability falls into place.
Phase 2Wiring One Endpoint End to End
Wire prompt, tokens, cost, latency, and quality into one trace
A trace is a tree of spans, and your LLM call is a tree
6 minOpenTelemetry's span-and-attribute model maps cleanly onto an LLM pipeline β each step (retrieve, rerank, model, tool, parse) is a span with structured attributes.
Wrap one call site in a span and stop guessing
7 minThe minimum viable instrumentation is one span around the model call, with prompt hash, model name, and token counts as attributes β that single span unlocks 70% of debugging.
Cost is a function of tokens β log both, track the ratio
7 minCost in dollars is derived; tokens in/out are the primary signal. Track the ratio of output to input tokens β it's the leading indicator of cost surprises.
Time-to-first-token is the latency that users feel
6 minFor streaming UIs, time-to-first-token (TTFT) matters more than total latency β users tolerate long answers if the first word arrives fast.
Online evals: the cheap version is a rule, not a model
7 minAn online eval doesn't have to be a judge LLM β a regex, a JSON-schema check, or a length-bound rule already gives you a quality signal you can chart over time.
Phase 3Mapping the Tool Landscape onto the Pillars
Map Langfuse, LangSmith, Helicone, and Phoenix onto the same pillars
Your CTO asks: Langfuse, LangSmith, Helicone, or Phoenix?
7 minEach tool emphasizes a different pillar β Langfuse on traces+evals, LangSmith on dev-loop debugging, Helicone on a logging proxy, Phoenix on offline analysis. Map them to the pillar you're weakest in.
A prompt change shipped Monday; quality dropped Friday
7 minPrompt drift is a versioning problem; the same observability stack that catches it also explains it once prompts are versioned and joined to traces.
The bill jumped 40% but traffic is flat
7 minWhen traffic is flat and cost spikes, the bug is in the per-call shape β usually output tokens, retry counts, or a quietly larger context window.
Your stack already has Datadog. Where do LLM signals belong?
7 minMost LLM-specific tools (Langfuse, LangSmith) run alongside, not instead of, your existing APM β they're the right home for prompts, evals, and per-call inspection; APM stays the home for cross-service traces and infra metrics.
Phase 4Sketching Your 3am Dashboard
Sketch a 3am dashboard for an LLM feature you actually ship
Sketch the dashboard an on-call engineer reads at 3am
12 minSketch the dashboard an on-call engineer reads at 3am
Frequently asked questions
- What is LLM observability and how is it different from regular APM?
- This is covered in the βLearn LLM Observability Fundamentalsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why don't Datadog or New Relic give me what I need for LLM features?
- This is covered in the βLearn LLM Observability Fundamentalsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What should I log on every LLM call beyond prompt and response?
- This is covered in the βLearn LLM Observability Fundamentalsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How is Langfuse different from LangSmith, Helicone, and Phoenix?
- This is covered in the βLearn LLM Observability Fundamentalsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I detect a silent quality regression after I change a prompt?
- This is covered in the βLearn LLM Observability Fundamentalsβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
πPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking β then ship a working caching or logging decorator from scratch in under 30 lines.
π¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic β one failing snippet at a time β until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
βΈοΈKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
πBig O Intuition
Stop treating Big O as math you memorized for an interview β build the intuition to spot O(nΒ²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(nΒ²) to O(n) in under five minutes.