π‘οΈUnderstand Jailbreaking and AI Safety
See LLM jailbreaking as four distinct attack families instead of one scary headline, then turn that taxonomy into a one-page risk note for an AI feature you actually ship.
Phase 1Why Safety Training Becomes a Target
See why safety training is a target, not a wall
Jailbreaking is bypassing post-training, not breaking a wall
6 minSafety isn't baked into the model's neurons. It's a behavior layer learned during post-training, and a jailbreak is any input that steers the model out of that learned behavior.
Four jailbreak families β name them and the headlines stop blurring
7 minMost published jailbreaks reduce to one of four families: persona, encoding, multi-turn, or gradient-based. Each exploits a different weakness in post-training, and each has a different difficulty and impact.
Jailbreak and prompt injection are different threats
7 minA jailbreak is the user steering the model off its policy. A prompt injection is a third party hijacking the model through data the user trusts. Same surface, different attacker, completely different defenses.
Whack-a-mole isn't a bug β it's the geometry of the problem
6 minEach patched jailbreak teaches the model one new edge case. The space of possible prompts is effectively infinite, so coverage approaches but never reaches the boundary. The cycle is structural, not a sign of failure.
Phase 2Three Jailbreak Families on a Toy Prompt
Walk three jailbreak families on a toy guarded prompt
Build the toy guarded prompt you'll attack all week
5 minYou can't see how attacks work without a defender to attack. A five-line system prompt with one rule is the smallest defender that still produces realistic refusals.
The persona attack works because the model wants to be helpful
7 minThe model doesn't 'forget' the rule under a persona prompt β it reframes the rule as 'something my normal self would do, but this character wouldn't.' The pull toward helpfulness does the rest.
Encoding attacks slip past safety because the filter reads tokens, not meaning
7 minRefusal training fires on the surface form of the request. Encode the request β base64, leetspeak, ROT13, low-resource language β and the surface form changes while the meaning survives. The model decodes faithfully because that's a useful skill.
Multi-turn attacks win because no single message is the attack
7 minEvery individual turn is benign. The trajectory across turns is the attack. Refusal classifiers that look at one message at a time can't see what the conversation is becoming.
Gradient attacks find inputs nobody could have written by hand
8 minWith access to model weights or a similar surrogate, you can directly optimize an input string against a 'must-refuse' loss. The output looks like garbage, but it bypasses safety reliably β and the strings often transfer to other models you never touched.
Phase 3How Red Teams Feed Back Into Training
Trace how red teams feed back into safer training
Red teams aren't enemies β they're the training data engine
6 minInternal and external red teams produce the failure cases that become the next batch of safety post-training. The attack pipeline is the defense pipeline. Without sustained red-teaming, models would degrade silently as the world's attack surface grows.
Adversarial training puts the attack inside the loss function
7 minInstead of waiting for attackers to find inputs, adversarial training generates them β gradient attacks, persona attacks, encoding attacks β and trains the model to refuse them as part of the same RL loop. The model learns to handle the family, not just the example.
No single layer holds β defense in depth is how systems actually stay safe
7 minProduction AI safety isn't post-training alone. It's post-training plus input filtering plus output classification plus rate limiting plus tool sandboxing plus human review for high-stakes calls. Each layer is leaky on its own; the combination is what holds.
Pick your threat model first β most jailbreak families won't be your threat
7 minWhether a jailbreak family matters to your AI feature depends on three things: what content the model can produce, what authority it has, and who the realistic attacker is. Most features face a small subset of the families. Naming yours saves you from defending against the wrong attacks.
Phase 4Write a One-Page Jailbreak Risk Note
Write a one-page risk note for your own feature
Write the one-page jailbreak risk note for your real feature
8 minWrite the one-page jailbreak risk note for your real feature
Frequently asked questions
- What is LLM jailbreaking and why does it work?
- This is covered in the βUnderstand Jailbreaking and AI Safetyβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- Are persona jailbreaks like DAN actually a real safety issue?
- This is covered in the βUnderstand Jailbreaking and AI Safetyβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- What's the difference between a jailbreak and prompt injection?
- This is covered in the βUnderstand Jailbreaking and AI Safetyβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do gradient-based attacks like GCG differ from clever prompts?
- This is covered in the βUnderstand Jailbreaking and AI Safetyβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
- How do AI labs use red-teaming to make models safer?
- This is covered in the βUnderstand Jailbreaking and AI Safetyβ learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Related paths
πPython Decorators Introduction
Build one mental model for Python decorators that covers closures, argument passing, functools.wraps, and stacking β then ship a working caching or logging decorator from scratch in under 30 lines.
π¦Rust Lifetimes Explained
Stop reading `'a` as line noise and start reading it as scope arithmetic β one failing snippet at a time β until you can thread lifetimes through a small parser or iterator adapter without fighting the borrow checker.
βΈοΈKubernetes Core Concepts
Stop drowning in 30+ resource types. Build the mental model one primitive at a time -- pods, deployments, services, ingress, config -- then deploy a real app with rolling updates and health checks.
πBig O Intuition
Stop treating Big O as math you memorized for an interview β build the intuition to spot O(nΒ²) disasters, pick the right data structure without thinking, and rewrite a slow function from O(nΒ²) to O(n) in under five minutes.