What is tokenization in an LLM?

This is covered in the "Understand Tokenization: How Models See Text" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does 'antidisestablishmentarianism' cost more tokens than 'apple'?

This is covered in the "Understand Tokenization: How Models See Text" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why do emoji and Chinese characters use so many tokens?

This is covered in the "Understand Tokenization: How Models See Text" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why do GPT, Claude, and Llama report different token counts for the same text?

This is covered in the "Understand Tokenization: How Models See Text" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How can I estimate the token count of a paragraph without running a tokenizer?

This is covered in the "Understand Tokenization: How Models See Text" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🔡Understand Tokenization: How Models See Text

Stop counting characters and start seeing text the way the model does — as subword pieces that vary wildly in cost. By the end you'll eyeball a paragraph's token count and know why emoji, code, and rare words inflate your bill.

Foundations14 drops~2-week path · 5–8 min/daytechnology

Phase 1How Models Actually See Text

Why character and word splits both fail the model

4 drops

Models don't read text — they read tokens
6 min
An LLM never sees the letters you typed. It sees a sequence of integers, each one pointing to a chunk of text the tokenizer carved out beforehand.
Character-level looked clean — and was a disaster
6 min
Treating each character as a token gives you a tiny vocabulary but explodes sequence length, blowing past context windows and burning compute on patterns the model has to relearn from scratch.
Word-level looked obvious — until you met a typo
6 min
Word-level tokenization seems intuitive, but it can't handle new words, typos, code, or any language that doesn't space-separate — and that's most of the real world.
Subword: common chunks stay whole, rare ones split
7 min
Subword tokenization keeps frequent words as single tokens and breaks rare ones into reusable pieces — so 'apple' stays one token but 'antidisestablishmentarianism' becomes seven.

Phase 2Run BPE on a Real Sentence

Run BPE by hand on a real sentence

5 drops

BPE is just 'merge the most frequent pair, repeat'
6 min
Byte Pair Encoding is one rule applied thousands of times: find the most frequent adjacent pair of tokens in your corpus, merge it into a new token, add it to the vocabulary. That's the entire algorithm.
Tokenize 'lower' and 'newest' with BPE manually
7 min
Walking BPE through two real words — 'lower' and 'newest' — shows why both end up sharing the merge 'er' but diverging on everything else.
Vocab size is a budget — and the tradeoffs are real
7 min
Bigger vocab means shorter sequences but bigger embedding tables and rarer-token coverage; smaller vocab means longer sequences but more frequent token reuse. Every tokenizer chooses a budget.
BOS, EOS, and chat templates: tokens you didn't write
7 min
Every API call sneaks in special tokens — start-of-sequence, end-of-sequence, role markers — that count against your bill but never appear in the text you sent.
Bytes, not characters — why emoji cost 3-5 tokens
7 min
Modern tokenizers operate on UTF-8 bytes, not Unicode characters, so a single emoji that takes 4 bytes can become 4 separate tokens — even though it's 'one character' to you.

Phase 3Why Tokens Cost Different Things

Why emoji, code, and CJK cost more tokens

4 drops

Same paragraph, three models, three token counts
7 min
GPT-4, Claude, and Llama all tokenize differently — so quoting 'this prompt is 1,000 tokens' is meaningless without saying which model you're targeting.
Code costs more than prose — and indentation costs the most
8 min
Code tokenizers were trained on prose-heavy corpora, so identifiers, indentation, and special characters fragment into many tokens. A 100-line Python file routinely costs 2-3× a 100-line English document.
Context windows lie — and the lie is in the tokenizer
7 min
A '128k context window' isn't 128k of your prompt — it's 128k tokens, which depending on your content might be 80k characters of code, 350k characters of English, or 50k of Chinese.
Three ways to spend fewer tokens — and one not to
7 min
Three legitimate ways to cut token cost — prompt compression, tokenizer-aware formatting, and provider-side caching — beat the tempting fourth way (truncation), which silently degrades quality.

Phase 4Predict and Verify Token Counts

Predict token counts and verify in a playground

1 drop

Predict, verify, and write yourself a token cheat sheet
8 min
Predict, verify, and write yourself a token cheat sheet

Frequently asked questions

What is tokenization in an LLM?: This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does 'antidisestablishmentarianism' cost more tokens than 'apple'?: This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do emoji and Chinese characters use so many tokens?: This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why do GPT, Claude, and Llama report different token counts for the same text?: This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How can I estimate the token count of a paragraph without running a tokenizer?: This is covered in the “Understand Tokenization: How Models See Text” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🔡Understand Tokenization: How Models See Text

Phase 1How Models Actually See Text

Models don't read text — they read tokens

Character-level looked clean — and was a disaster

Word-level looked obvious — until you met a typo

Subword: common chunks stay whole, rare ones split

Phase 2Run BPE on a Real Sentence

BPE is just 'merge the most frequent pair, repeat'

Tokenize 'lower' and 'newest' with BPE manually

Vocab size is a budget — and the tradeoffs are real

BOS, EOS, and chat templates: tokens you didn't write

Bytes, not characters — why emoji cost 3-5 tokens

Phase 3Why Tokens Cost Different Things

Same paragraph, three models, three token counts

Code costs more than prose — and indentation costs the most

Context windows lie — and the lie is in the tokenizer

Three ways to spend fewer tokens — and one not to

Phase 4Predict and Verify Token Counts

Predict, verify, and write yourself a token cheat sheet

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1How Models Actually See Text

Models don't read text — they read tokens

Character-level looked clean — and was a disaster

Word-level looked obvious — until you met a typo

Subword: common chunks stay whole, rare ones split

Phase 2Run BPE on a Real Sentence

BPE is just 'merge the most frequent pair, repeat'

Tokenize 'lower' and 'newest' with BPE manually

Vocab size is a budget — and the tradeoffs are real

BOS, EOS, and chat templates: tokens you didn't write

Bytes, not characters — why emoji cost 3-5 tokens

Phase 3Why Tokens Cost Different Things

Same paragraph, three models, three token counts

Code costs more than prose — and indentation costs the most

Context windows lie — and the lie is in the tokenizer

Three ways to spend fewer tokens — and one not to

Phase 4Predict and Verify Token Counts

Predict, verify, and write yourself a token cheat sheet

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition