🧬Sentence vs Token Embeddings
Stop grabbing BERT's [CLS] token and calling it a sentence embedding. By the end you'll know exactly when token, pooled, and contrastively-trained vectors each win — and design a 100K-doc semantic search you can defend.
Phase 1What Each Vector Actually Represents
What token vs sentence embeddings actually represent
A token vector is a context-aware fragment, not a meaning
6 minA token vector is a context-aware fragment, not a meaning
What a sentence embedding actually has to do
6 minWhat a sentence embedding actually has to do
Why [CLS] looks like a sentence embedding but isn't
7 minWhy [CLS] looks like a sentence embedding but isn't
Mean-pooling is better than [CLS] and still not enough
7 minMean-pooling is better than [CLS] and still not enough
Phase 2Three Embeddings on One Task
Compare [CLS], mean-pool, and sentence-transformers head-to-head
Pick a single task and lock the rest down
6 minPick a single task and lock the rest down
Run [CLS], mean-pool, and SBERT head-to-head
9 minRun [CLS], mean-pool, and SBERT head-to-head
What contrastive training actually changes
7 minWhat contrastive training actually changes
Pooling tricks: mean, max, CLS, attention
7 minPooling tricks: mean, max, CLS, attention
When token embeddings are still the right tool
6 minWhen token embeddings are still the right tool
Phase 3Pipelines, Not Single Choices
Place bi-encoders, cross-encoders, and rerankers in a pipeline
Bi-encoders are the only embedding that scales
7 minBi-encoders are the only embedding that scales
Cross-encoders are the only embedding that nuances
7 minCross-encoders are the only embedding that nuances
Two-stage retrieve-and-rerank is the canonical shape
7 minTwo-stage retrieve-and-rerank is the canonical shape
Phase 4Design the 100K Search
Design a 100K-doc semantic search and defend it
Choose the bi-encoder for 100K documents
7 minChoose the bi-encoder for 100K documents
Design and defend a 100K-doc semantic search
20 minDesign and defend a 100K-doc semantic search
Frequently asked questions
- What's the difference between a token embedding and a sentence embedding?
- This is covered in the “Sentence vs Token Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why is BERT's [CLS] token a bad sentence embedding out of the box?
- This is covered in the “Sentence vs Token Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- When should I use mean-pooled BERT vs a sentence-transformers model?
- This is covered in the “Sentence vs Token Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Do I need a cross-encoder reranker on top of bi-encoder retrieval?
- This is covered in the “Sentence vs Token Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do I pick an embedding model for a 100K-document semantic search?
- This is covered in the “Sentence vs Token Embeddings” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.