How do transformers process sequences without recurrence?

This is covered in the "Learn How Transformers Process Sequences" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What is the role of positional encoding in a transformer?

This is covered in the "Learn How Transformers Process Sequences" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Why does multi-head attention work better than a single attention head?

This is covered in the "Learn How Transformers Process Sequences" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How is BERT different from GPT architecturally?

This is covered in the "Learn How Transformers Process Sequences" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What does a feed-forward layer do inside a transformer block?

This is covered in the "Learn How Transformers Process Sequences" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🔀Learn How Transformers Process Sequences

Trace one token through every block of a transformer — embed, position, attend, FFN, residual — until you can narrate, in plain English, how 'the cat sat' becomes French.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Attention Replaced Recurrence

See why attention beat recurrence on long context

4 drops

RNNs read one word at a time, transformers read all of them at once
6 min
Transformers traded sequential reading for parallel attention — and that single trade unlocked everything.
A transformer is five blocks, repeated
6 min
The whole architecture is embed, position, attend, FFN, residual — stacked. Once you see the five blocks, you see every model.
Without positional encoding, a transformer reads a sentence as a bag of words
6 min
Attention is order-blind. You have to inject position explicitly, or 'cat sat dog' and 'dog sat cat' look identical.
Attention is just three matrix multiplies and a softmax
7 min
What looks like AI magic is one of the simplest operations in deep learning — three multiplies and a softmax over similarity scores.

Phase 2Tracing a Token Through the Stack

Walk a single token through every block

5 drops

The model never sees 'cat' — it sees integer 4937
6 min
Before any neural network math happens, your sentence becomes a list of integers. The vocabulary is the model's whole world.
Embeddings turn an integer into a vector full of meaning
6 min
An embedding is a row lookup in a giant table — and that lookup is where 'cat' starts to mean something close to 'feline.'
Position encoding adds 'where you are' to 'what you are'
6 min
After embedding, each token vector gets stamped with its position — by adding a position-shaped vector right on top.
Multi-head attention lets every token look at every other token, eight different ways
7 min
One head learns syntax, another learns coreference, another learns word order — and the model gets to use all of them at once.
FFN and residuals are where attention's output gets digested
7 min
Attention mixes tokens. FFN transforms each token alone. Residuals make sure information from earlier layers never gets lost.

Phase 3How Encoder, Decoder, and Both Differ

Compare BERT, GPT, and T5 architectures

4 drops

BERT reads the whole sentence at once — even the future
6 min
Encoder-only models like BERT see every token bidirectionally. They're built for understanding, not generating.
GPT can only read the past — that's why it can write the future
6 min
Decoder-only models add a causal mask: token N can attend to tokens 1 through N, but never N+1 or later. That mask is the entire reason GPT generates.
T5 keeps both halves — encoder reads, decoder writes
7 min
Encoder-decoder models split the work: one half understands the input, the other half generates the output, with cross-attention bridging them.
Same five blocks, three different wirings, three different jobs
7 min
BERT, GPT, and T5 share the same building blocks. Only the attention mask and the stack arrangement change — and that's enough to give them completely different capabilities.

Phase 4Translating 'The Cat Sat' Block by Block

Narrate the cat sat translation end to end

1 drop

Narrate the full path: 'the cat sat' to 'le chat était assis'
8 min
Once you can narrate every block of a transformer in plain language for one tiny example, you genuinely understand the architecture.

Frequently asked questions

How do transformers process sequences without recurrence?: This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What is the role of positional encoding in a transformer?: This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does multi-head attention work better than a single attention head?: This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How is BERT different from GPT architecturally?: This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does a feed-forward layer do inside a transformer block?: This is covered in the “Learn How Transformers Process Sequences” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🔀Learn How Transformers Process Sequences

Phase 1Why Attention Replaced Recurrence

RNNs read one word at a time, transformers read all of them at once

A transformer is five blocks, repeated

Without positional encoding, a transformer reads a sentence as a bag of words

Attention is just three matrix multiplies and a softmax

Phase 2Tracing a Token Through the Stack

The model never sees 'cat' — it sees integer 4937

Embeddings turn an integer into a vector full of meaning

Position encoding adds 'where you are' to 'what you are'

Multi-head attention lets every token look at every other token, eight different ways

FFN and residuals are where attention's output gets digested

Phase 3How Encoder, Decoder, and Both Differ

BERT reads the whole sentence at once — even the future

GPT can only read the past — that's why it can write the future

T5 keeps both halves — encoder reads, decoder writes

Same five blocks, three different wirings, three different jobs

Phase 4Translating 'The Cat Sat' Block by Block

Narrate the full path: 'the cat sat' to 'le chat était assis'

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Attention Replaced Recurrence

RNNs read one word at a time, transformers read all of them at once

A transformer is five blocks, repeated

Without positional encoding, a transformer reads a sentence as a bag of words

Attention is just three matrix multiplies and a softmax

Phase 2Tracing a Token Through the Stack

The model never sees 'cat' — it sees integer 4937

Embeddings turn an integer into a vector full of meaning

Position encoding adds 'where you are' to 'what you are'

Multi-head attention lets every token look at every other token, eight different ways

FFN and residuals are where attention's output gets digested

Phase 3How Encoder, Decoder, and Both Differ

BERT reads the whole sentence at once — even the future

GPT can only read the past — that's why it can write the future

T5 keeps both halves — encoder reads, decoder writes

Same five blocks, three different wirings, three different jobs

Phase 4Translating 'The Cat Sat' Block by Block

Narrate the full path: 'the cat sat' to 'le chat était assis'

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition