Back to library

🚦Version and A/B Test Prompts in Production

Stop shipping prompt edits like config tweaks and start treating them like code with versions, canaries, and kill switches. By the end you can write a one-page rollout plan with success criteria, sample size, and a rollback trigger that someone else could execute.

Advanced14 drops~2-week path Β· 5–8 min/daytechnology

Phase 1Why Prompts Are Software, Not Config

Treat prompts as software with identifiers and diffs

4 drops
  1. Your prompt is software, even when it looks like a string

    6 min

    The moment a prompt shapes user-facing behavior, it needs the same primitives any production code gets β€” an identifier, a diff, an audit trail, and a way back.

  2. The three concerns you keep tangling together

    6 min

    Versioning, environment promotion, and A/B testing are three independent problems β€” confusing them is why most prompt rollouts feel chaotic.

  3. A prompt identifier is more than a string name

    7 min

    A real prompt identifier locks together the template, the model, and the sampling parameters β€” change any one and you have a new version.

  4. Rollback is a feature, not an emergency

    6 min

    If rolling back a prompt requires a deploy, you don't have rollback β€” you have a slow recovery procedure.

Phase 2Tagging Versions and Canarying Traffic

Tag two versions and canary 10% of traffic

5 drops
  1. Cut your first real prompt version

    7 min

    Versioning starts the moment you stop editing the prompt in place and start appending new versions next to it.

  2. Log the version on every single request

    6 min

    If you don't log the version ID on each LLM call, the table is just decoration β€” you can't tie outcomes back to versions.

  3. Route 10% of traffic to the new version

    7 min

    A canary is just a routing rule plus the same versioning and logging you already have.

  4. Pick the quality signal before you ship the canary

    7 min

    If you can't name the metric that would make you roll back, you don't have an experiment β€” you have a vibe check.

  5. Run the canary end-to-end on a toy prompt

    7 min

    Doing the whole loop on something low-stakes first reveals every gap your real rollout will hit β€” but on a Tuesday instead of during an incident.

Phase 3Reading Signal Like an Experimenter

Read significance, stable assignment, and trial discipline

4 drops
  1. The canary looks worse β€” should you roll back?

    8 min

    Most apparent quality differences in the first hours of a canary are noise; the discipline is in deciding what threshold tells you it's signal.

  2. How many users do you actually need?

    8 min

    Sample size depends on three things you can name today: baseline rate, minimum detectable effect, and acceptable false-positive rate.

  3. Stop running twelve A/Bs at once

    8 min

    Concurrent prompt experiments interact β€” running too many at once means you can't attribute outcomes to any single change.

  4. The model changed under you β€” now what?

    8 min

    Provider model updates can shift output behavior overnight, mimicking a prompt regression that isn't yours β€” the version-pin discipline is what makes the difference visible.

Phase 4Writing the Rollout Plan Someone Else Could Run

Plan a real rollout with kill switch and sample size

1 drop
  1. Write the prompt rollout plan you'd hand to a teammate

    25 min

    Write the prompt rollout plan you'd hand to a teammate

Frequently asked questions

What does it mean to version a prompt and how is it different from a git commit?
This is covered in the β€œVersion and A/B Test Prompts in Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do you A/B test a prompt without running 12 experiments at once?
This is covered in the β€œVersion and A/B Test Prompts in Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What sample size do I need to detect a prompt regression?
This is covered in the β€œVersion and A/B Test Prompts in Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I tell if a quality drop came from the prompt, the model, or sampling parameters?
This is covered in the β€œVersion and A/B Test Prompts in Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What goes in a prompt rollout kill switch?
This is covered in the β€œVersion and A/B Test Prompts in Production” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.