Back to library

🔀Learn How Transformers Process Sequences

Trace one token through every block of a transformer — embed, position, attend, FFN, residual — until you can narrate, in plain English, how 'the cat sat' becomes French.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Attention Replaced Recurrence

See why attention beat recurrence on long context

4 drops
  1. RNNs read one word at a time, transformers read all of them at once

    6 min

    Transformers traded sequential reading for parallel attention — and that single trade unlocked everything.

  2. A transformer is five blocks, repeated

    6 min

    The whole architecture is embed, position, attend, FFN, residual — stacked. Once you see the five blocks, you see every model.

  3. Without positional encoding, a transformer reads a sentence as a bag of words

    6 min

    Attention is order-blind. You have to inject position explicitly, or 'cat sat dog' and 'dog sat cat' look identical.

  4. Attention is just three matrix multiplies and a softmax

    7 min

    What looks like AI magic is one of the simplest operations in deep learning — three multiplies and a softmax over similarity scores.

Phase 2Tracing a Token Through the Stack

Walk a single token through every block

5 drops
  1. The model never sees 'cat' — it sees integer 4937

    6 min

    Before any neural network math happens, your sentence becomes a list of integers. The vocabulary is the model's whole world.

  2. Embeddings turn an integer into a vector full of meaning

    6 min

    An embedding is a row lookup in a giant table — and that lookup is where 'cat' starts to mean something close to 'feline.'

  3. Position encoding adds 'where you are' to 'what you are'

    6 min

    After embedding, each token vector gets stamped with its position — by adding a position-shaped vector right on top.

  4. Multi-head attention lets every token look at every other token, eight different ways

    7 min

    One head learns syntax, another learns coreference, another learns word order — and the model gets to use all of them at once.

  5. FFN and residuals are where attention's output gets digested

    7 min

    Attention mixes tokens. FFN transforms each token alone. Residuals make sure information from earlier layers never gets lost.

Phase 3How Encoder, Decoder, and Both Differ

Compare BERT, GPT, and T5 architectures

4 drops
  1. BERT reads the whole sentence at once — even the future

    6 min

    Encoder-only models like BERT see every token bidirectionally. They're built for understanding, not generating.

  2. GPT can only read the past — that's why it can write the future

    6 min

    Decoder-only models add a causal mask: token N can attend to tokens 1 through N, but never N+1 or later. That mask is the entire reason GPT generates.

  3. T5 keeps both halves — encoder reads, decoder writes

    7 min

    Encoder-decoder models split the work: one half understands the input, the other half generates the output, with cross-attention bridging them.

  4. Same five blocks, three different wirings, three different jobs

    7 min

    BERT, GPT, and T5 share the same building blocks. Only the attention mask and the stack arrangement change — and that's enough to give them completely different capabilities.

Phase 4Translating 'The Cat Sat' Block by Block

Narrate the cat sat translation end to end

1 drop
  1. Narrate the full path: 'the cat sat' to 'le chat était assis'

    8 min

    Once you can narrate every block of a transformer in plain language for one tiny example, you genuinely understand the architecture.

Frequently asked questions

How do transformers process sequences without recurrence?
This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is the role of positional encoding in a transformer?
This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does multi-head attention work better than a single attention head?
This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is BERT different from GPT architecturally?
This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does a feed-forward layer do inside a transformer block?
This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.