Back to library

🎙️Understand Speech-to-Text Accuracy and WER

Stop trusting WER numbers from someone else's benchmark — build a 50-clip eval set from your own production audio so the next time you swap transcription vendors, the decision rests on your data, not theirs.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1What WER Actually Measures

See why benchmark WER lies about your audio

4 drops
  1. WER counts the three ways a transcript can be wrong

    6 min

    WER counts the three ways a transcript can be wrong

  2. 5% on LibriSpeech, 18% on your call recordings

    6 min

    5% on LibriSpeech, 18% on your call recordings

  3. The same model scores 5% or 12% depending on how you normalize

    6 min

    The same model scores 5% or 12% depending on how you normalize

  4. Word-level breaks down when words aren't the right unit

    6 min

    Word-level breaks down when words aren't the right unit

Phase 2Computing WER on Your Own Clips

Score three providers on five clips you control

5 drops
  1. The clips you grab in 20 minutes beat the benchmark every time

    6 min

    The clips you grab in 20 minutes beat the benchmark every time

  2. Same five clips through Whisper, Deepgram, AssemblyAI — race the API calls

    7 min

    Same five clips through Whisper, Deepgram, AssemblyAI — race the API calls

  3. Twenty lines of Python and you have three vendor WERs to compare

    6 min

    Twenty lines of Python and you have three vendor WERs to compare

  4. The per-clip diff is where vendor selection actually happens

    7 min

    The per-clip diff is where vendor selection actually happens

  5. When the model is wrong AND wrong-confident, you have a different problem

    6 min

    When the model is wrong AND wrong-confident, you have a different problem

Phase 3Where Domain Shift Breaks Models

Spot where accents, noise, and jargon break models

4 drops
  1. The accent that triples WER overnight

    7 min

    The accent that triples WER overnight

  2. Medical and legal vocabulary breaks the general-purpose model

    7 min

    Medical and legal vocabulary breaks the general-purpose model

  3. Two people talking at once is a different model problem

    7 min

    Two people talking at once is a different model problem

  4. When fine-tuning beats prompting, and when it doesn't

    7 min

    When fine-tuning beats prompting, and when it doesn't

Phase 4Designing a 50-Clip Eval Set

Build a 50-clip eval set for your real audio mix

1 drop
  1. Design a 50-clip eval set that represents your real production audio mix

    8 min

    Design a 50-clip eval set that represents your real production audio mix

Frequently asked questions

What is word error rate (WER) and how is it calculated?
This is covered in the “Understand Speech-to-Text Accuracy and WER” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does Whisper's 5% WER not match what I see in production?
This is covered in the “Understand Speech-to-Text Accuracy and WER” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How many clips do I need in a speech-to-text eval set?
This is covered in the “Understand Speech-to-Text Accuracy and WER” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I handle domain-specific jargon in transcription evals?
This is covered in the “Understand Speech-to-Text Accuracy and WER” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between WER and CER, and when should I use each?
This is covered in the “Understand Speech-to-Text Accuracy and WER” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.