What is LLM jailbreaking and why does it work?

This is covered in the "Understand Jailbreaking and AI Safety" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Are persona jailbreaks like DAN actually a real safety issue?

This is covered in the "Understand Jailbreaking and AI Safety" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

What's the difference between a jailbreak and prompt injection?

This is covered in the "Understand Jailbreaking and AI Safety" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do gradient-based attacks like GCG differ from clever prompts?

This is covered in the "Understand Jailbreaking and AI Safety" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

How do AI labs use red-teaming to make models safer?

This is covered in the "Understand Jailbreaking and AI Safety" learning path on Droplet. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

Back to library

🛡️Understand Jailbreaking and AI Safety

See LLM jailbreaking as four distinct attack families instead of one scary headline, then turn that taxonomy into a one-page risk note for an AI feature you actually ship.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why Safety Training Becomes a Target

See why safety training is a target, not a wall

4 drops

Jailbreaking is bypassing post-training, not breaking a wall
6 min
Safety isn't baked into the model's neurons. It's a behavior layer learned during post-training, and a jailbreak is any input that steers the model out of that learned behavior.
Four jailbreak families — name them and the headlines stop blurring
7 min
Most published jailbreaks reduce to one of four families: persona, encoding, multi-turn, or gradient-based. Each exploits a different weakness in post-training, and each has a different difficulty and impact.
Jailbreak and prompt injection are different threats
7 min
A jailbreak is the user steering the model off its policy. A prompt injection is a third party hijacking the model through data the user trusts. Same surface, different attacker, completely different defenses.
Whack-a-mole isn't a bug — it's the geometry of the problem
6 min
Each patched jailbreak teaches the model one new edge case. The space of possible prompts is effectively infinite, so coverage approaches but never reaches the boundary. The cycle is structural, not a sign of failure.

Phase 2Three Jailbreak Families on a Toy Prompt

Walk three jailbreak families on a toy guarded prompt

5 drops

Build the toy guarded prompt you'll attack all week
5 min
You can't see how attacks work without a defender to attack. A five-line system prompt with one rule is the smallest defender that still produces realistic refusals.
The persona attack works because the model wants to be helpful
7 min
The model doesn't 'forget' the rule under a persona prompt — it reframes the rule as 'something my normal self would do, but this character wouldn't.' The pull toward helpfulness does the rest.
Encoding attacks slip past safety because the filter reads tokens, not meaning
7 min
Refusal training fires on the surface form of the request. Encode the request — base64, leetspeak, ROT13, low-resource language — and the surface form changes while the meaning survives. The model decodes faithfully because that's a useful skill.
Multi-turn attacks win because no single message is the attack
7 min
Every individual turn is benign. The trajectory across turns is the attack. Refusal classifiers that look at one message at a time can't see what the conversation is becoming.
Gradient attacks find inputs nobody could have written by hand
8 min
With access to model weights or a similar surrogate, you can directly optimize an input string against a 'must-refuse' loss. The output looks like garbage, but it bypasses safety reliably — and the strings often transfer to other models you never touched.

Phase 3How Red Teams Feed Back Into Training

Trace how red teams feed back into safer training

4 drops

Red teams aren't enemies — they're the training data engine
6 min
Internal and external red teams produce the failure cases that become the next batch of safety post-training. The attack pipeline is the defense pipeline. Without sustained red-teaming, models would degrade silently as the world's attack surface grows.
Adversarial training puts the attack inside the loss function
7 min
Instead of waiting for attackers to find inputs, adversarial training generates them — gradient attacks, persona attacks, encoding attacks — and trains the model to refuse them as part of the same RL loop. The model learns to handle the family, not just the example.
No single layer holds — defense in depth is how systems actually stay safe
7 min
Production AI safety isn't post-training alone. It's post-training plus input filtering plus output classification plus rate limiting plus tool sandboxing plus human review for high-stakes calls. Each layer is leaky on its own; the combination is what holds.
Pick your threat model first — most jailbreak families won't be your threat
7 min
Whether a jailbreak family matters to your AI feature depends on three things: what content the model can produce, what authority it has, and who the realistic attacker is. Most features face a small subset of the families. Naming yours saves you from defending against the wrong attacks.

Phase 4Write a One-Page Jailbreak Risk Note

Write a one-page risk note for your own feature

1 drop

Write the one-page jailbreak risk note for your real feature
8 min
Write the one-page jailbreak risk note for your real feature

Frequently asked questions

What is LLM jailbreaking and why does it work?: This is covered in the “Understand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Are persona jailbreaks like DAN actually a real safety issue?: This is covered in the “Understand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between a jailbreak and prompt injection?: This is covered in the “Understand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do gradient-based attacks like GCG differ from clever prompts?: This is covered in the “Understand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do AI labs use red-teaming to make models safer?: This is covered in the “Understand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.

🛡️Understand Jailbreaking and AI Safety

Phase 1Why Safety Training Becomes a Target

Jailbreaking is bypassing post-training, not breaking a wall

Four jailbreak families — name them and the headlines stop blurring

Jailbreak and prompt injection are different threats

Whack-a-mole isn't a bug — it's the geometry of the problem

Phase 2Three Jailbreak Families on a Toy Prompt

Build the toy guarded prompt you'll attack all week

The persona attack works because the model wants to be helpful

Encoding attacks slip past safety because the filter reads tokens, not meaning

Multi-turn attacks win because no single message is the attack

Gradient attacks find inputs nobody could have written by hand

Phase 3How Red Teams Feed Back Into Training

Red teams aren't enemies — they're the training data engine

Adversarial training puts the attack inside the loss function

No single layer holds — defense in depth is how systems actually stay safe

Pick your threat model first — most jailbreak families won't be your threat

Phase 4Write a One-Page Jailbreak Risk Note

Write the one-page jailbreak risk note for your real feature

Frequently asked questions

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition

Phase 1Why Safety Training Becomes a Target

Jailbreaking is bypassing post-training, not breaking a wall

Four jailbreak families — name them and the headlines stop blurring

Jailbreak and prompt injection are different threats

Whack-a-mole isn't a bug — it's the geometry of the problem

Phase 2Three Jailbreak Families on a Toy Prompt

Build the toy guarded prompt you'll attack all week

The persona attack works because the model wants to be helpful

Encoding attacks slip past safety because the filter reads tokens, not meaning

Multi-turn attacks win because no single message is the attack

Gradient attacks find inputs nobody could have written by hand

Phase 3How Red Teams Feed Back Into Training

Red teams aren't enemies — they're the training data engine

Adversarial training puts the attack inside the loss function

No single layer holds — defense in depth is how systems actually stay safe

Pick your threat model first — most jailbreak families won't be your threat

Phase 4Write a One-Page Jailbreak Risk Note

Write the one-page jailbreak risk note for your real feature

Frequently asked questions

Related paths

🐍Python Decorators Introduction

🦀Rust Lifetimes Explained

☸️Kubernetes Core Concepts

📈Big O Intuition