🔡Understand Tokenization: How Models See Text
Stop counting characters and start seeing text the way the model does — as subword pieces that vary wildly in cost. By the end you'll eyeball a paragraph's token count and know why emoji, code, and rare words inflate your bill.
Phase 1How Models Actually See Text
Why character and word splits both fail the model
Models don't read text — they read tokens
6 minAn LLM never sees the letters you typed. It sees a sequence of integers, each one pointing to a chunk of text the tokenizer carved out beforehand.
Character-level looked clean — and was a disaster
6 minTreating each character as a token gives you a tiny vocabulary but explodes sequence length, blowing past context windows and burning compute on patterns the model has to relearn from scratch.
Word-level looked obvious — until you met a typo
6 minWord-level tokenization seems intuitive, but it can't handle new words, typos, code, or any language that doesn't space-separate — and that's most of the real world.
Subword: common chunks stay whole, rare ones split
7 minSubword tokenization keeps frequent words as single tokens and breaks rare ones into reusable pieces — so 'apple' stays one token but 'antidisestablishmentarianism' becomes seven.
Phase 2Run BPE on a Real Sentence
Run BPE by hand on a real sentence
BPE is just 'merge the most frequent pair, repeat'
6 minByte Pair Encoding is one rule applied thousands of times: find the most frequent adjacent pair of tokens in your corpus, merge it into a new token, add it to the vocabulary. That's the entire algorithm.
Tokenize 'lower' and 'newest' with BPE manually
7 minWalking BPE through two real words — 'lower' and 'newest' — shows why both end up sharing the merge 'er' but diverging on everything else.
Vocab size is a budget — and the tradeoffs are real
7 minBigger vocab means shorter sequences but bigger embedding tables and rarer-token coverage; smaller vocab means longer sequences but more frequent token reuse. Every tokenizer chooses a budget.
BOS, EOS, and chat templates: tokens you didn't write
7 minEvery API call sneaks in special tokens — start-of-sequence, end-of-sequence, role markers — that count against your bill but never appear in the text you sent.
Bytes, not characters — why emoji cost 3-5 tokens
7 minModern tokenizers operate on UTF-8 bytes, not Unicode characters, so a single emoji that takes 4 bytes can become 4 separate tokens — even though it's 'one character' to you.
Phase 3Why Tokens Cost Different Things
Why emoji, code, and CJK cost more tokens
Same paragraph, three models, three token counts
7 minGPT-4, Claude, and Llama all tokenize differently — so quoting 'this prompt is 1,000 tokens' is meaningless without saying which model you're targeting.
Code costs more than prose — and indentation costs the most
8 minCode tokenizers were trained on prose-heavy corpora, so identifiers, indentation, and special characters fragment into many tokens. A 100-line Python file routinely costs 2-3× a 100-line English document.
Context windows lie — and the lie is in the tokenizer
7 minA '128k context window' isn't 128k of your prompt — it's 128k tokens, which depending on your content might be 80k characters of code, 350k characters of English, or 50k of Chinese.
Three ways to spend fewer tokens — and one not to
7 minThree legitimate ways to cut token cost — prompt compression, tokenizer-aware formatting, and provider-side caching — beat the tempting fourth way (truncation), which silently degrades quality.
Phase 4Predict and Verify Token Counts
Predict token counts and verify in a playground
Predict, verify, and write yourself a token cheat sheet
8 minPredict, verify, and write yourself a token cheat sheet
Frequently asked questions
- What is tokenization in an LLM?
- This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does 'antidisestablishmentarianism' cost more tokens than 'apple'?
- This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do emoji and Chinese characters use so many tokens?
- This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why do GPT, Claude, and Llama report different token counts for the same text?
- This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How can I estimate the token count of a paragraph without running a tokenizer?
- This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.