π¦Version and A/B Test Prompts in Production
Stop shipping prompt edits like config tweaks and start treating them like code with versions, canaries, and kill switches. By the end you can write a one-page rollout plan with success criteria, sample size, and a rollback trigger that someone else could execute.
Phase 1Why Prompts Are Software, Not Config
Treat prompts as software with identifiers and diffs
Your prompt is software, even when it looks like a string
6 minThe moment a prompt shapes user-facing behavior, it needs the same primitives any production code gets β an identifier, a diff, an audit trail, and a way back.
The three concerns you keep tangling together
6 minVersioning, environment promotion, and A/B testing are three independent problems β confusing them is why most prompt rollouts feel chaotic.
A prompt identifier is more than a string name
7 minA real prompt identifier locks together the template, the model, and the sampling parameters β change any one and you have a new version.
Rollback is a feature, not an emergency
6 minIf rolling back a prompt requires a deploy, you don't have rollback β you have a slow recovery procedure.
Phase 2Tagging Versions and Canarying Traffic
Tag two versions and canary 10% of traffic
Cut your first real prompt version
7 minVersioning starts the moment you stop editing the prompt in place and start appending new versions next to it.
Log the version on every single request
6 minIf you don't log the version ID on each LLM call, the table is just decoration β you can't tie outcomes back to versions.
Route 10% of traffic to the new version
7 minA canary is just a routing rule plus the same versioning and logging you already have.
Pick the quality signal before you ship the canary
7 minIf you can't name the metric that would make you roll back, you don't have an experiment β you have a vibe check.
Run the canary end-to-end on a toy prompt
7 minDoing the whole loop on something low-stakes first reveals every gap your real rollout will hit β but on a Tuesday instead of during an incident.
Phase 3Reading Signal Like an Experimenter
Read significance, stable assignment, and trial discipline
The canary looks worse β should you roll back?
8 minMost apparent quality differences in the first hours of a canary are noise; the discipline is in deciding what threshold tells you it's signal.
How many users do you actually need?
8 minSample size depends on three things you can name today: baseline rate, minimum detectable effect, and acceptable false-positive rate.
Stop running twelve A/Bs at once
8 minConcurrent prompt experiments interact β running too many at once means you can't attribute outcomes to any single change.
The model changed under you β now what?
8 minProvider model updates can shift output behavior overnight, mimicking a prompt regression that isn't yours β the version-pin discipline is what makes the difference visible.
Phase 4Writing the Rollout Plan Someone Else Could Run
Plan a real rollout with kill switch and sample size
Write the prompt rollout plan you'd hand to a teammate
25 minWrite the prompt rollout plan you'd hand to a teammate
Frequently asked questions
- What does it mean to version a prompt and how is it different from a git commit?
- This is covered in the βVersion and A/B Test Prompts in Productionβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you A/B test a prompt without running 12 experiments at once?
- This is covered in the βVersion and A/B Test Prompts in Productionβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What sample size do I need to detect a prompt regression?
- This is covered in the βVersion and A/B Test Prompts in Productionβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I tell if a quality drop came from the prompt, the model, or sampling parameters?
- This is covered in the βVersion and A/B Test Prompts in Productionβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What goes in a prompt rollout kill switch?
- This is covered in the βVersion and A/B Test Prompts in Productionβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
πPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking β then ship a working caching or logging decorator from scratch in under 30 lines.
π¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic β one failing snippet at a time β until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
βΈοΈKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
πBig O Intuition
Stop treating Big O as math you memorized for an interview β build the intuition to spot O(nΒ²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(nΒ²) to O(n) in under five minutes.