Back to library

πŸ›‘οΈUnderstand Jailbreaking and AI Safety

See LLM jailbreaking as four distinct attack families instead of one scary headline, then turn that taxonomy into a one-page risk note for an AI feature you actually ship.

Applied14 drops~2-week path Β· 5–8 min/daytechnology

Phase 1Why Safety Training Becomes a Target

See why safety training is a target, not a wall

4 drops
  1. Jailbreaking is bypassing post-training, not breaking a wall

    6 min

    Safety isn't baked into the model's neurons. It's a behavior layer learned during post-training, and a jailbreak is any input that steers the model out of that learned behavior.

  2. Four jailbreak families β€” name them and the headlines stop blurring

    7 min

    Most published jailbreaks reduce to one of four families: persona, encoding, multi-turn, or gradient-based. Each exploits a different weakness in post-training, and each has a different difficulty and impact.

  3. Jailbreak and prompt injection are different threats

    7 min

    A jailbreak is the user steering the model off its policy. A prompt injection is a third party hijacking the model through data the user trusts. Same surface, different attacker, completely different defenses.

  4. Whack-a-mole isn't a bug β€” it's the geometry of the problem

    6 min

    Each patched jailbreak teaches the model one new edge case. The space of possible prompts is effectively infinite, so coverage approaches but never reaches the boundary. The cycle is structural, not a sign of failure.

Phase 2Three Jailbreak Families on a Toy Prompt

Walk three jailbreak families on a toy guarded prompt

5 drops
  1. Build the toy guarded prompt you'll attack all week

    5 min

    You can't see how attacks work without a defender to attack. A five-line system prompt with one rule is the smallest defender that still produces realistic refusals.

  2. The persona attack works because the model wants to be helpful

    7 min

    The model doesn't 'forget' the rule under a persona prompt β€” it reframes the rule as 'something my normal self would do, but this character wouldn't.' The pull toward helpfulness does the rest.

  3. Encoding attacks slip past safety because the filter reads tokens, not meaning

    7 min

    Refusal training fires on the surface form of the request. Encode the request β€” base64, leetspeak, ROT13, low-resource language β€” and the surface form changes while the meaning survives. The model decodes faithfully because that's a useful skill.

  4. Multi-turn attacks win because no single message is the attack

    7 min

    Every individual turn is benign. The trajectory across turns is the attack. Refusal classifiers that look at one message at a time can't see what the conversation is becoming.

  5. Gradient attacks find inputs nobody could have written by hand

    8 min

    With access to model weights or a similar surrogate, you can directly optimize an input string against a 'must-refuse' loss. The output looks like garbage, but it bypasses safety reliably β€” and the strings often transfer to other models you never touched.

Phase 3How Red Teams Feed Back Into Training

Trace how red teams feed back into safer training

4 drops
  1. Red teams aren't enemies β€” they're the training data engine

    6 min

    Internal and external red teams produce the failure cases that become the next batch of safety post-training. The attack pipeline is the defense pipeline. Without sustained red-teaming, models would degrade silently as the world's attack surface grows.

  2. Adversarial training puts the attack inside the loss function

    7 min

    Instead of waiting for attackers to find inputs, adversarial training generates them β€” gradient attacks, persona attacks, encoding attacks β€” and trains the model to refuse them as part of the same RL loop. The model learns to handle the family, not just the example.

  3. No single layer holds β€” defense in depth is how systems actually stay safe

    7 min

    Production AI safety isn't post-training alone. It's post-training plus input filtering plus output classification plus rate limiting plus tool sandboxing plus human review for high-stakes calls. Each layer is leaky on its own; the combination is what holds.

  4. Pick your threat model first β€” most jailbreak families won't be your threat

    7 min

    Whether a jailbreak family matters to your AI feature depends on three things: what content the model can produce, what authority it has, and who the realistic attacker is. Most features face a small subset of the families. Naming yours saves you from defending against the wrong attacks.

Phase 4Write a One-Page Jailbreak Risk Note

Write a one-page risk note for your own feature

1 drop
  1. Write the one-page jailbreak risk note for your real feature

    8 min

    Write the one-page jailbreak risk note for your real feature

Frequently asked questions

What is LLM jailbreaking and why does it work?
This is covered in the β€œUnderstand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Are persona jailbreaks like DAN actually a real safety issue?
This is covered in the β€œUnderstand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What's the difference between a jailbreak and prompt injection?
This is covered in the β€œUnderstand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do gradient-based attacks like GCG differ from clever prompts?
This is covered in the β€œUnderstand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How do AI labs use red-teaming to make models safer?
This is covered in the β€œUnderstand Jailbreaking and AI Safety” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.