🔀Learn How Transformers Process Sequences
Trace one token through every block of a transformer — embed, position, attend, FFN, residual — until you can narrate, in plain English, how 'the cat sat' becomes French.
Phase 1Why Attention Replaced Recurrence
See why attention beat recurrence on long context
RNNs read one word at a time, transformers read all of them at once
6 minTransformers traded sequential reading for parallel attention — and that single trade unlocked everything.
A transformer is five blocks, repeated
6 minThe whole architecture is embed, position, attend, FFN, residual — stacked. Once you see the five blocks, you see every model.
Without positional encoding, a transformer reads a sentence as a bag of words
6 minAttention is order-blind. You have to inject position explicitly, or 'cat sat dog' and 'dog sat cat' look identical.
Attention is just three matrix multiplies and a softmax
7 minWhat looks like AI magic is one of the simplest operations in deep learning — three multiplies and a softmax over similarity scores.
Phase 2Tracing a Token Through the Stack
Walk a single token through every block
The model never sees 'cat' — it sees integer 4937
6 minBefore any neural network math happens, your sentence becomes a list of integers. The vocabulary is the model's whole world.
Embeddings turn an integer into a vector full of meaning
6 minAn embedding is a row lookup in a giant table — and that lookup is where 'cat' starts to mean something close to 'feline.'
Position encoding adds 'where you are' to 'what you are'
6 minAfter embedding, each token vector gets stamped with its position — by adding a position-shaped vector right on top.
Multi-head attention lets every token look at every other token, eight different ways
7 minOne head learns syntax, another learns coreference, another learns word order — and the model gets to use all of them at once.
FFN and residuals are where attention's output gets digested
7 minAttention mixes tokens. FFN transforms each token alone. Residuals make sure information from earlier layers never gets lost.
Phase 3How Encoder, Decoder, and Both Differ
Compare BERT, GPT, and T5 architectures
BERT reads the whole sentence at once — even the future
6 minEncoder-only models like BERT see every token bidirectionally. They're built for understanding, not generating.
GPT can only read the past — that's why it can write the future
6 minDecoder-only models add a causal mask: token N can attend to tokens 1 through N, but never N+1 or later. That mask is the entire reason GPT generates.
T5 keeps both halves — encoder reads, decoder writes
7 minEncoder-decoder models split the work: one half understands the input, the other half generates the output, with cross-attention bridging them.
Same five blocks, three different wirings, three different jobs
7 minBERT, GPT, and T5 share the same building blocks. Only the attention mask and the stack arrangement change — and that's enough to give them completely different capabilities.
Phase 4Translating 'The Cat Sat' Block by Block
Narrate the cat sat translation end to end
Narrate the full path: 'the cat sat' to 'le chat était assis'
8 minOnce you can narrate every block of a transformer in plain language for one tiny example, you genuinely understand the architecture.
Frequently asked questions
- How do transformers process sequences without recurrence?
- This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What is the role of positional encoding in a transformer?
- This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Why does multi-head attention work better than a single attention head?
- This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How is BERT different from GPT architecturally?
- This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What does a feed-forward layer do inside a transformer block?
- This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
🐍Python Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking — then ship a working caching or logging decorator from scratch in under 30 lines.
🦀Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic — one failing snippet at a time — until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
☸️Kubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
📈Big O Intuition
Stop treating Big O as math you memorized for an interview — build the intuition to spot O(n²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(n²) to O(n) in under five minutes.