Back to library

🔡Understand Tokenization: How Models See Text

Stop counting characters and start seeing text the way the model does — as subword pieces that vary wildly in cost. By the end you'll eyeball a paragraph's token count and know why emoji, code, and rare words inflate your bill.

Foundations14 drops~2-week path · 5–8 min/daytechnology

Phase 1How Models Actually See Text

Why character and word splits both fail the model

4 drops
  1. Models don't read text — they read tokens

    6 min

    An LLM never sees the letters you typed. It sees a sequence of integers, each one pointing to a chunk of text the tokenizer carved out beforehand.

  2. Character-level looked clean — and was a disaster

    6 min

    Treating each character as a token gives you a tiny vocabulary but explodes sequence length, blowing past context windows and burning compute on patterns the model has to relearn from scratch.

  3. Word-level looked obvious — until you met a typo

    6 min

    Word-level tokenization seems intuitive, but it can't handle new words, typos, code, or any language that doesn't space-separate — and that's most of the real world.

  4. Subword: common chunks stay whole, rare ones split

    7 min

    Subword tokenization keeps frequent words as single tokens and breaks rare ones into reusable pieces — so 'apple' stays one token but 'antidisestablishmentarianism' becomes seven.

Phase 2Run BPE on a Real Sentence

Run BPE by hand on a real sentence

5 drops
  1. BPE is just 'merge the most frequent pair, repeat'

    6 min

    Byte Pair Encoding is one rule applied thousands of times: find the most frequent adjacent pair of tokens in your corpus, merge it into a new token, add it to the vocabulary. That's the entire algorithm.

  2. Tokenize 'lower' and 'newest' with BPE manually

    7 min

    Walking BPE through two real words — 'lower' and 'newest' — shows why both end up sharing the merge 'er' but diverging on everything else.

  3. Vocab size is a budget — and the tradeoffs are real

    7 min

    Bigger vocab means shorter sequences but bigger embedding tables and rarer-token coverage; smaller vocab means longer sequences but more frequent token reuse. Every tokenizer chooses a budget.

  4. BOS, EOS, and chat templates: tokens you didn't write

    7 min

    Every API call sneaks in special tokens — start-of-sequence, end-of-sequence, role markers — that count against your bill but never appear in the text you sent.

  5. Bytes, not characters — why emoji cost 3-5 tokens

    7 min

    Modern tokenizers operate on UTF-8 bytes, not Unicode characters, so a single emoji that takes 4 bytes can become 4 separate tokens — even though it's 'one character' to you.

Phase 3Why Tokens Cost Different Things

Why emoji, code, and CJK cost more tokens

4 drops
  1. Same paragraph, three models, three token counts

    7 min

    GPT-4, Claude, and Llama all tokenize differently — so quoting 'this prompt is 1,000 tokens' is meaningless without saying which model you're targeting.

  2. Code costs more than prose — and indentation costs the most

    8 min

    Code tokenizers were trained on prose-heavy corpora, so identifiers, indentation, and special characters fragment into many tokens. A 100-line Python file routinely costs 2-3× a 100-line English document.

  3. Context windows lie — and the lie is in the tokenizer

    7 min

    A '128k context window' isn't 128k of your prompt — it's 128k tokens, which depending on your content might be 80k characters of code, 350k characters of English, or 50k of Chinese.

  4. Three ways to spend fewer tokens — and one not to

    7 min

    Three legitimate ways to cut token cost — prompt compression, tokenizer-aware formatting, and provider-side caching — beat the tempting fourth way (truncation), which silently degrades quality.

Phase 4Predict and Verify Token Counts

Predict token counts and verify in a playground

1 drop
  1. Predict, verify, and write yourself a token cheat sheet

    8 min

    Predict, verify, and write yourself a token cheat sheet

Frequently asked questions

What is tokenization in an LLM?
This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does 'antidisestablishmentarianism' cost more tokens than 'apple'?
This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do emoji and Chinese characters use so many tokens?
This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do GPT, Claude, and Llama report different token counts for the same text?
This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How can I estimate the token count of a paragraph without running a tokenizer?
This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.