🔊Understand Text-to-Speech Quality Dimensions
Build a five-axis TTS scorecard — naturalness, prosody, latency, consistency, controllability — that replaces demo-vibe-checks with a defensible audit you can take into any voice-agent vendor meeting.
Phase 1Seeing Past MOS to What Actually Breaks Voice Products
See past MOS to what actually breaks voice products
MOS is a vibe-check dressed up as a number
6 minMean Opinion Score collapses everything into one rating from a panel of listeners — which is exactly why a 4.3 model can still feel wrong in production.
Prosody is the axis that makes TTS sound human or robotic
7 minProsody — rhythm, stress, intonation — is where modern TTS still cracks. Phoneme accuracy is solved; making sentences feel emotionally appropriate is not.
Latency is a product metric, not a benchmark
6 minFor voice agents, time-to-first-byte is the only TTS latency number that matters. Total synthesis time is irrelevant when the user hears speech in chunks.
Consistency and controllability are the hidden axes
7 minVoice drift across utterances and lack of pronunciation control are the failures that show up at scale, not in demos. Most demos hide them by being short and curated.
Phase 2Scoring Three Vendors on the Five-Axis Scorecard
Score ElevenLabs, OpenAI, and Cartesia on five axes
Build the five-axis scorecard you'll defend in every meeting
7 minNaturalness, prosody, latency, consistency, controllability. Five axes, weighted by use case, scored 1-5 from your own ears and tests — not marketing pages.
Score ElevenLabs — the naturalness-first premium option
7 minElevenLabs optimizes for 'sounds most human' and 'biggest voice library.' That's the pitch and the limitation in one sentence.
Score OpenAI tts-1 — the developer-friendly default
7 minOpenAI's TTS is the SQLite of TTS: nobody's first choice for premium voice, everyone's first choice when an OpenAI API key is already in the stack.
Score Cartesia Sonic — the latency-obsessed voice-agent option
7 minCartesia is what happens when an infra team builds TTS for voice agents instead of narration. The TTFB numbers show it.
Run the blind test that makes the scorecard real
8 minNaming a winner from marketing pages is vendor capture in disguise. A blind A/B test on your own sentences is the only score that survives a design-review challenge.
Phase 3Reading the Failure Modes That Force a Switch
Spot when latency, drift, or prosody force a switch
Our voice agent felt laggy and the TTS wasn't the obvious culprit
8 minWhen a voice agent feels slow, the instinct is to blame TTS — but TTFB is one of four contributors and rarely the largest. Audit the whole pipeline before swapping vendors.
The voice clone drifted on minute eight — and it wasn't random
8 minVoice consistency at scale is a function of model stability parameters and prompt structure, not vendor quality alone. Drift is fixable before it's a migration.
We hit 100K daily voice minutes and the bill became a board question
8 minCost trajectory is a year-2 problem you make in week 1. The migration trigger is rarely quality — it's the bill at scale.
Voice clones don't migrate — and that quietly changes everything
8 minMigrating between TTS providers is mostly painless. Migrating cloned voices means re-recording or re-cloning every brand voice. The lock-in you actually have is to your voices, not your vendor.
Phase 4Defending the Voice-Agent TTS Choice
Pick a TTS for a voice agent and defend it
Write the decision memo for a voice-agent TTS pick
25 minThe deliverable that proves you understand TTS selection is a one-page memo that survives a design-review cross-examination.
Frequently asked questions
- What is MOS and is it enough to judge a TTS vendor?
- This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does my TTS sound great in the demo but off in production?
- This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you measure TTS latency for a real-time voice agent?
- This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What's the difference between streaming TTS and narration TTS?
- This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do you keep a TTS voice consistent across long-form content?
- This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.