Back to library

πŸ”­Learn LLM Observability Fundamentals

Stop finding out about LLM regressions from angry user emails. By the end you'll know what to log on every call, which tools fit which signal, and how to sketch one dashboard an on-call engineer can read at 3am.

Applied14 drops~2-week path Β· 5–8 min/daytechnology

Phase 1Why Generic APM Misses LLM-Shaped Problems

See why generic APM misses LLM-shaped problems

4 drops
  1. Your APM is green and your LLM feature is broken

    6 min

    Traditional APM tracks latency and errors; LLMs fail silently with bad outputs while every request returns 200 OK.

  2. Logs, metrics, traces β€” but the units are different

    6 min

    The three-pillars model still applies; you just swap the underlying signals from HTTP/SQL to prompt, response, tokens, cost, latency, and eval score.

  3. Three failure modes APM was never built to see

    7 min

    Silent quality regressions, prompt drift, and cost spikes are the three failure modes unique to LLM systems β€” none show up on a generic APM dashboard.

  4. Six signals to log on every LLM call

    6 min

    Prompt (with version), response, tokens in/out, cost, latency, and a quality signal β€” capture these on every call and 80% of LLM observability falls into place.

Phase 2Wiring One Endpoint End to End

Wire prompt, tokens, cost, latency, and quality into one trace

5 drops
  1. A trace is a tree of spans, and your LLM call is a tree

    6 min

    OpenTelemetry's span-and-attribute model maps cleanly onto an LLM pipeline β€” each step (retrieve, rerank, model, tool, parse) is a span with structured attributes.

  2. Wrap one call site in a span and stop guessing

    7 min

    The minimum viable instrumentation is one span around the model call, with prompt hash, model name, and token counts as attributes β€” that single span unlocks 70% of debugging.

  3. Cost is a function of tokens β€” log both, track the ratio

    7 min

    Cost in dollars is derived; tokens in/out are the primary signal. Track the ratio of output to input tokens β€” it's the leading indicator of cost surprises.

  4. Time-to-first-token is the latency that users feel

    6 min

    For streaming UIs, time-to-first-token (TTFT) matters more than total latency β€” users tolerate long answers if the first word arrives fast.

  5. Online evals: the cheap version is a rule, not a model

    7 min

    An online eval doesn't have to be a judge LLM β€” a regex, a JSON-schema check, or a length-bound rule already gives you a quality signal you can chart over time.

Phase 3Mapping the Tool Landscape onto the Pillars

Map Langfuse, LangSmith, Helicone, and Phoenix onto the same pillars

4 drops
  1. Your CTO asks: Langfuse, LangSmith, Helicone, or Phoenix?

    7 min

    Each tool emphasizes a different pillar β€” Langfuse on traces+evals, LangSmith on dev-loop debugging, Helicone on a logging proxy, Phoenix on offline analysis. Map them to the pillar you're weakest in.

  2. A prompt change shipped Monday; quality dropped Friday

    7 min

    Prompt drift is a versioning problem; the same observability stack that catches it also explains it once prompts are versioned and joined to traces.

  3. The bill jumped 40% but traffic is flat

    7 min

    When traffic is flat and cost spikes, the bug is in the per-call shape β€” usually output tokens, retry counts, or a quietly larger context window.

  4. Your stack already has Datadog. Where do LLM signals belong?

    7 min

    Most LLM-specific tools (Langfuse, LangSmith) run alongside, not instead of, your existing APM β€” they're the right home for prompts, evals, and per-call inspection; APM stays the home for cross-service traces and infra metrics.

Phase 4Sketching Your 3am Dashboard

Sketch a 3am dashboard for an LLM feature you actually ship

1 drop
  1. Sketch the dashboard an on-call engineer reads at 3am

    12 min

    Sketch the dashboard an on-call engineer reads at 3am

Frequently asked questions

What is LLM observability and how is it different from regular APM?
This is covered in the β€œLearn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why don't Datadog or New Relic give me what I need for LLM features?
This is covered in the β€œLearn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What should I log on every LLM call beyond prompt and response?
This is covered in the β€œLearn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is Langfuse different from LangSmith, Helicone, and Phoenix?
This is covered in the β€œLearn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I detect a silent quality regression after I change a prompt?
This is covered in the β€œLearn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.