Testing the Limits of Jailbreaking Defenses with the Purple Problem
Taeyoun Kim, Suhas Kotha, Aditi Raghunathan
TL;DR
This work isolates the enforcement and definition components of jailbreaking defenses by introducing the Purple Problem, a simple, explicit unsafe-output definition. It shows that existing enforcement methods (fine-tuning, input processing, and post-processing) fail under adaptive attacks, even for a trivial definition, casting doubt on the robustness of current defenses for complex safety notions. The study demonstrates that real-world benchmarks largely measure enforcement and can be 'solved' with output filtering, while definitions themselves are flawed and under-tested. The findings argue for a shift toward developing higher-quality definitions and evaluation protocols that stress-test defenses against adaptive adversaries, rather than relying solely on enforcement-oriented fixes.
Abstract
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
