🔬Build an LLM Eval Harness for Production
Stop running eval notebooks once and forgetting them. Build a three-layer harness — pre-merge CI, pre-deploy gate, online sampling — with the right cadence, budget, and judge calibration for a production RAG app.
Phase 1Why one-shot eval notebooks rot
Why one-shot eval notebooks always rot
The notebook eval is a snapshot, not a system
6 minThe notebook eval is a snapshot, not a system
Every mature LLM team converges on three eval surfaces
7 minEvery mature LLM team converges on three eval surfaces
Eval cost compounds — budget per layer or pay a surprise bill
7 minEval cost compounds — budget per layer or pay a surprise bill
Eval sets rot when the real distribution shifts
7 minEval sets rot when the real distribution shifts
Phase 2Golden examples in CI regression
Wire 10 golden examples into CI regression
Pick 10 examples that span your real failure modes
7 minPick 10 examples that span your real failure modes
Write the expected output before you see what the model says
7 minWrite the expected output before you see what the model says
Move the eval into the repo as code, not a colab
7 minMove the eval into the repo as code, not a colab
Wire the harness into CI with a regression threshold
8 minWire the harness into CI with a regression threshold
Break the prompt on purpose to verify CI catches it
7 minBreak the prompt on purpose to verify CI catches it
Phase 3LLM-as-judge: when it's reliable
LLM-as-judge: when to trust, when to verify
Your teammate proposes 'just use GPT-4 to grade everything'
7 minYour teammate proposes 'just use GPT-4 to grade everything'
The judge prefers verbose, formal answers — and you didn't notice
8 minThe judge prefers verbose, formal answers — and you didn't notice
The aggregate went up — but adversarial items quietly failed
8 minThe aggregate went up — but adversarial items quietly failed
The judge is right 90% of the time — is that enough?
8 minThe judge is right 90% of the time — is that enough?
Phase 4Design a harness for a RAG app
Design a three-layer harness for RAG
Design a three-layer eval harness for a RAG support bot
10 minDesign a three-layer eval harness for a RAG support bot
Frequently asked questions
- What's the difference between pre-merge CI evals, pre-deploy gates, and online sampling?
- This is covered in the “Build an LLM Eval Harness for Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How many golden examples do you need before a CI eval is useful?
- This is covered in the “Build an LLM Eval Harness for Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- When is LLM-as-judge reliable enough to trust unsupervised?
- This is covered in the “Build an LLM Eval Harness for Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you budget eval cost across three layers without blowing the bill?
- This is covered in the “Build an LLM Eval Harness for Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do one-off eval notebooks rot, and what replaces them?
- This is covered in the “Build an LLM Eval Harness for Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.