Back to library

🔊Understand Text-to-Speech Quality Dimensions

Build a five-axis TTS scorecard — naturalness, prosody, latency, consistency, controllability — that replaces demo-vibe-checks with a defensible audit you can take into any voice-agent vendor meeting.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Seeing Past MOS to What Actually Breaks Voice Products

See past MOS to what actually breaks voice products

4 drops
  1. MOS is a vibe-check dressed up as a number

    6 min

    Mean Opinion Score collapses everything into one rating from a panel of listeners — which is exactly why a 4.3 model can still feel wrong in production.

  2. Prosody is the axis that makes TTS sound human or robotic

    7 min

    Prosody — rhythm, stress, intonation — is where modern TTS still cracks. Phoneme accuracy is solved; making sentences feel emotionally appropriate is not.

  3. Latency is a product metric, not a benchmark

    6 min

    For voice agents, time-to-first-byte is the only TTS latency number that matters. Total synthesis time is irrelevant when the user hears speech in chunks.

  4. Consistency and controllability are the hidden axes

    7 min

    Voice drift across utterances and lack of pronunciation control are the failures that show up at scale, not in demos. Most demos hide them by being short and curated.

Phase 2Scoring Three Vendors on the Five-Axis Scorecard

Score ElevenLabs, OpenAI, and Cartesia on five axes

5 drops
  1. Build the five-axis scorecard you'll defend in every meeting

    7 min

    Naturalness, prosody, latency, consistency, controllability. Five axes, weighted by use case, scored 1-5 from your own ears and tests — not marketing pages.

  2. Score ElevenLabs — the naturalness-first premium option

    7 min

    ElevenLabs optimizes for 'sounds most human' and 'biggest voice library.' That's the pitch and the limitation in one sentence.

  3. Score OpenAI tts-1 — the developer-friendly default

    7 min

    OpenAI's TTS is the SQLite of TTS: nobody's first choice for premium voice, everyone's first choice when an OpenAI API key is already in the stack.

  4. Score Cartesia Sonic — the latency-obsessed voice-agent option

    7 min

    Cartesia is what happens when an infra team builds TTS for voice agents instead of narration. The TTFB numbers show it.

  5. Run the blind test that makes the scorecard real

    8 min

    Naming a winner from marketing pages is vendor capture in disguise. A blind A/B test on your own sentences is the only score that survives a design-review challenge.

Phase 3Reading the Failure Modes That Force a Switch

Spot when latency, drift, or prosody force a switch

4 drops
  1. Our voice agent felt laggy and the TTS wasn't the obvious culprit

    8 min

    When a voice agent feels slow, the instinct is to blame TTS — but TTFB is one of four contributors and rarely the largest. Audit the whole pipeline before swapping vendors.

  2. The voice clone drifted on minute eight — and it wasn't random

    8 min

    Voice consistency at scale is a function of model stability parameters and prompt structure, not vendor quality alone. Drift is fixable before it's a migration.

  3. We hit 100K daily voice minutes and the bill became a board question

    8 min

    Cost trajectory is a year-2 problem you make in week 1. The migration trigger is rarely quality — it's the bill at scale.

  4. Voice clones don't migrate — and that quietly changes everything

    8 min

    Migrating between TTS providers is mostly painless. Migrating cloned voices means re-recording or re-cloning every brand voice. The lock-in you actually have is to your voices, not your vendor.

Phase 4Defending the Voice-Agent TTS Choice

Pick a TTS for a voice agent and defend it

1 drop
  1. Write the decision memo for a voice-agent TTS pick

    25 min

    The deliverable that proves you understand TTS selection is a one-page memo that survives a design-review cross-examination.

Frequently asked questions

What is MOS and is it enough to judge a TTS vendor?
This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does my TTS sound great in the demo but off in production?
This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do you measure TTS latency for a real-time voice agent?
This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between streaming TTS and narration TTS?
This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do you keep a TTS voice consistent across long-form content?
This is covered in the “Understand Text-to-Speech Quality Dimensions” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.