Back to library

🎧Understand Audio Embeddings

Stop forcing audio through text. Drops show the audio-native path — wav2vec, CLAP, MERT — and when it beats transcribe-then-embed for music search, speaker ID, and sound classification. By the end you can plan a 'find similar drums' search over a sample library.

Foundations14 drops~2-week path · 5–8 min/daytechnology

Phase 1What Audio Embeddings Actually Compress

Why audio is high-dimensional and what embeddings compress

4 drops
  1. Audio is a million numbers per minute and you can't search it raw

    6 min

    One minute of audio is millions of samples in time. Embeddings collapse it into a few hundred numbers that preserve perceptual structure.

  2. Transcribe-then-embed throws away everything that isn't words

    6 min

    Speech-to-text plus a text embedding captures meaning of words. It throws away pitch, tempo, timbre, speaker identity, and every non-speech sound.

  3. An audio embedding is a learned summary of a spectrogram

    6 min

    Inside, modern audio embedding models convert audio to a spectrogram (or learned equivalent) and run a transformer or CNN over it. The output is a fixed-size vector pooled across time.

  4. Cosine similarity over embeddings is your distance function

    6 min

    Once two clips are vectors, 'how similar are they' becomes a single line: cosine similarity. The geometry takes over.

Phase 2Computing Embeddings and Finding Similar Clips

Compute CLAP embeddings for five clips and find similar pairs

5 drops
  1. Load CLAP and embed one clip in five lines

    7 min

    CLAP (Contrastive Language-Audio Pretraining) takes raw audio and returns a 512-dim vector. Loading it is the same Hugging Face pattern as any other model.

  2. Embed five clips and rank them by cosine similarity

    7 min

    With five embeddings in hand, a 5×5 cosine matrix tells you which pairs sound alike — and the answer matches what your ears say.

  3. Use the same embedding for zero-shot classification

    7 min

    Because CLAP is trained on (audio, text-caption) pairs, you can classify a clip by embedding both the clip and a list of text labels, then picking the closest text.

  4. Index a small library with FAISS for sublinear search

    8 min

    FAISS turns 'compare against every clip' into 'jump to the right neighborhood'. At 10,000+ clips, you need an index — not a for-loop.

  5. Debug an embedding pipeline by checking shapes and norms

    7 min

    When audio search returns garbage, the bug is almost always preprocessing — wrong sample rate, wrong channels, missing normalization, or wrong duration.

Phase 3Choosing the Right Audio Embedding Model

Pick between wav2vec, MERT, and CLAP for real tasks

4 drops
  1. Your team wants speech search — wav2vec, not CLAP

    7 min

    wav2vec2 (and its successors HuBERT, WavLM) are trained on speech. They preserve phonemes, prosody, and speaker identity in a way general-audio models don't.

  2. For music tasks, MERT is the model with the right priors

    7 min

    MERT is trained on music with objectives that preserve pitch, harmony, and rhythm. It picks up structural music features that CLAP only partially encodes.

  3. CLAP wins when you need text queries over audio

    7 min

    CLAP's superpower isn't audio-only similarity — it's that you can search audio with a text caption ('a dog barking at night'). No other major audio embedding does this out of the box.

  4. Production audio search bugs are domain-shift bugs

    7 min

    When an audio search works in dev and fails in prod, the cause is almost always a distribution shift: cleaner training data than real-world audio, or a mismatch between expected clip length and what users actually upload.

Phase 4Designing a Find-Similar-Drums Search

Plan a 'find similar drums' search over a sample library

1 drop
  1. Plan a 'find similar drums' search over a sample library

    12 min

    Plan a 'find similar drums' search over a sample library

Frequently asked questions

What is an audio embedding and how is it different from a text embedding?
This is covered in the “Understand Audio Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why not just transcribe audio and embed the text?
This is covered in the “Understand Audio Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When should I use CLAP vs wav2vec vs MERT?
This is covered in the “Understand Audio Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do I find similar audio clips with embeddings?
This is covered in the “Understand Audio Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Can audio embeddings capture pitch, tempo, and timbre?
This is covered in the “Understand Audio Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.