What is LLM observability and how is it different from regular APM?

This is covered in the "Learn LLM Observability Fundamentals" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why don't Datadog or New Relic give me what I need for LLM features?

This is covered in the "Learn LLM Observability Fundamentals" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What should I log on every LLM call beyond prompt and response?

This is covered in the "Learn LLM Observability Fundamentals" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How is Langfuse different from LangSmith, Helicone, and Phoenix?

This is covered in the "Learn LLM Observability Fundamentals" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do I detect a silent quality regression after I change a prompt?

This is covered in the "Learn LLM Observability Fundamentals" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🔭Learn LLM Observability Fundamentals

Stop finding out about LLM regressions from angry user emails. By the end you'll know what to log on every call, which tools fit which signal, and how to sketch one dashboard an on-call engineer can read at 3am.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Generic APM Misses LLM-Shaped Problems

See why generic APM misses LLM-shaped problems

4 drops

Your APM is green and your LLM feature is broken
6 min
Traditional APM tracks latency and errors; LLMs fail silently with bad outputs while every request returns 200 OK.
Logs, metrics, traces — but the units are different
6 min
The three-pillars model still applies; you just swap the underlying signals from HTTP/SQL to prompt, response, tokens, cost, latency, and eval score.
Three failure modes APM was never built to see
7 min
Silent quality regressions, prompt drift, and cost spikes are the three failure modes unique to LLM systems — none show up on a generic APM dashboard.
Six signals to log on every LLM call
6 min
Prompt (with version), response, tokens in/out, cost, latency, and a quality signal — capture these on every call and 80% of LLM observability falls into place.

Phase 2Wiring One Endpoint End to End

Wire prompt, tokens, cost, latency, and quality into one trace

5 drops

A trace is a tree of spans, and your LLM call is a tree
6 min
OpenTelemetry's span-and-attribute model maps cleanly onto an LLM pipeline — each step (retrieve, rerank, model, tool, parse) is a span with structured attributes.
Wrap one call site in a span and stop guessing
7 min
The minimum viable instrumentation is one span around the model call, with prompt hash, model name, and token counts as attributes — that single span unlocks 70% of debugging.
Cost is a function of tokens — log both, track the ratio
7 min
Cost in dollars is derived; tokens in/out are the primary signal. Track the ratio of output to input tokens — it's the leading indicator of cost surprises.
Time-to-first-token is the latency that users feel
6 min
For streaming UIs, time-to-first-token (TTFT) matters more than total latency — users tolerate long answers if the first word arrives fast.
Online evals: the cheap version is a rule, not a model
7 min
An online eval doesn't have to be a judge LLM — a regex, a JSON-schema check, or a length-bound rule already gives you a quality signal you can chart over time.

Phase 3Mapping the Tool Landscape onto the Pillars

Map Langfuse, LangSmith, Helicone, and Phoenix onto the same pillars

4 drops

Your CTO asks: Langfuse, LangSmith, Helicone, or Phoenix?
7 min
Each tool emphasizes a different pillar — Langfuse on traces+evals, LangSmith on dev-loop debugging, Helicone on a logging proxy, Phoenix on offline analysis. Map them to the pillar you're weakest in.
A prompt change shipped Monday; quality dropped Friday
7 min
Prompt drift is a versioning problem; the same observability stack that catches it also explains it once prompts are versioned and joined to traces.
The bill jumped 40% but traffic is flat
7 min
When traffic is flat and cost spikes, the bug is in the per-call shape — usually output tokens, retry counts, or a quietly larger context window.
Your stack already has Datadog. Where do LLM signals belong?
7 min
Most LLM-specific tools (Langfuse, LangSmith) run alongside, not instead of, your existing APM — they're the right home for prompts, evals, and per-call inspection; APM stays the home for cross-service traces and infra metrics.

Phase 4Sketching Your 3am Dashboard

Sketch a 3am dashboard for an LLM feature you actually ship

1 drop

Sketch the dashboard an on-call engineer reads at 3am
12 min
Sketch the dashboard an on-call engineer reads at 3am

Frequently asked questions

What is LLM observability and how is it different from regular APM?: This is covered in the “Learn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why don't Datadog or New Relic give me what I need for LLM features?: This is covered in the “Learn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What should I log on every LLM call beyond prompt and response?: This is covered in the “Learn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is Langfuse different from LangSmith, Helicone, and Phoenix?: This is covered in the “Learn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I detect a silent quality regression after I change a prompt?: This is covered in the “Learn LLM Observability Fundamentals” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🔭Learn LLM Observability Fundamentals

Phase 1Why Generic APM Misses LLM-Shaped Problems

Your APM is green and your LLM feature is broken

Logs, metrics, traces — but the units are different

Three failure modes APM was never built to see

Six signals to log on every LLM call

Phase 2Wiring One Endpoint End to End

A trace is a tree of spans, and your LLM call is a tree

Wrap one call site in a span and stop guessing

Cost is a function of tokens — log both, track the ratio

Time-to-first-token is the latency that users feel

Online evals: the cheap version is a rule, not a model

Phase 3Mapping the Tool Landscape onto the Pillars

Your CTO asks: Langfuse, LangSmith, Helicone, or Phoenix?

A prompt change shipped Monday; quality dropped Friday

The bill jumped 40% but traffic is flat

Your stack already has Datadog. Where do LLM signals belong?

Phase 4Sketching Your 3am Dashboard

Sketch the dashboard an on-call engineer reads at 3am

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Generic APM Misses LLM-Shaped Problems

Your APM is green and your LLM feature is broken

Logs, metrics, traces — but the units are different

Three failure modes APM was never built to see

Six signals to log on every LLM call

Phase 2Wiring One Endpoint End to End

A trace is a tree of spans, and your LLM call is a tree

Wrap one call site in a span and stop guessing

Cost is a function of tokens — log both, track the ratio

Time-to-first-token is the latency that users feel

Online evals: the cheap version is a rule, not a model

Phase 3Mapping the Tool Landscape onto the Pillars

Your CTO asks: Langfuse, LangSmith, Helicone, or Phoenix?

A prompt change shipped Monday; quality dropped Friday

The bill jumped 40% but traffic is flat

Your stack already has Datadog. Where do LLM signals belong?

Phase 4Sketching Your 3am Dashboard

Sketch the dashboard an on-call engineer reads at 3am

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition