Table of Contents
Fetching ...

Testing the Limits of Jailbreaking Defenses with the Purple Problem

Taeyoun Kim, Suhas Kotha, Aditi Raghunathan

TL;DR

This work isolates the enforcement and definition components of jailbreaking defenses by introducing the Purple Problem, a simple, explicit unsafe-output definition. It shows that existing enforcement methods (fine-tuning, input processing, and post-processing) fail under adaptive attacks, even for a trivial definition, casting doubt on the robustness of current defenses for complex safety notions. The study demonstrates that real-world benchmarks largely measure enforcement and can be 'solved' with output filtering, while definitions themselves are flawed and under-tested. The findings argue for a shift toward developing higher-quality definitions and evaluation protocols that stress-test defenses against adaptive adversaries, rather than relying solely on enforcement-oriented fixes.

Abstract

The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.

Testing the Limits of Jailbreaking Defenses with the Purple Problem

TL;DR

This work isolates the enforcement and definition components of jailbreaking defenses by introducing the Purple Problem, a simple, explicit unsafe-output definition. It shows that existing enforcement methods (fine-tuning, input processing, and post-processing) fail under adaptive attacks, even for a trivial definition, casting doubt on the robustness of current defenses for complex safety notions. The study demonstrates that real-world benchmarks largely measure enforcement and can be 'solved' with output filtering, while definitions themselves are flawed and under-tested. The findings argue for a shift toward developing higher-quality definitions and evaluation protocols that stress-test defenses against adaptive adversaries, rather than relying solely on enforcement-oriented fixes.

Abstract

The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
Paper Structure (59 sections, 6 figures, 21 tables)

This paper contains 59 sections, 6 figures, 21 tables.

Figures (6)

  • Figure 1: Define and Enforce Framework. We believe modern jailbreaking defenses can be decomposed into defining what constitutes an unsafe vs safe (purple vs yellow) output and designing a system that enforces this definition. This enforcement can be done via preprocessing inputs, fine-tuning the underlying language model, or postprocessing outputs. If the resulting system is safe, it will only output text that is safe under the given definition.
  • Figure 2: Enforcement Strategies for Purple Problem. Since the Purple Problem has a perfect definition, we focus on the Enforce stage as laid out in Figure \ref{['fig:defenses-framework']}. We consider a threat model where the attacker aims to find an input where the model outputs purple. A defender aims to control the input, model weights, and output to prevent outputting purple. We find that defenses that focus on input defenses and fine-tuning are not adversarially robust, whereas output filtering is.
  • Figure 3: Attack perplexity under Llama-IT. We take natural prompts, prompts with adversarial suffixes, and prompts with adaptively trained adversarial suffixes and measure their log perplexity. The perplexity defense can perfectly distinguish the basic attacks from the natural prompts. However, the adaptive attack lowers the perplexity of adversarial inputs well below natural prompts. Vicuna and Llama-2-chat in Appendix \ref{['sec:perplexity-attack']}
  • Figure 4: Convergence over training (1) The left shows the convergence of the reward margin over 3 epochs of training. The models fine-tuned with their optimal learning rate and $\beta$ factor are trained until saturation. Thus, models have been trained to be as robust as possible with DPO. (2) The right shows the defense success rate on natural prompts converging at 100% on models trained with only 10% of the dataset.
  • Figure 5: Fine-tuning Convergence The left shows the optimization loss for GCG suffixes becoming 0 with more optimization steps. This means that the DSR also becomes 0% as the optimization step increases.
  • ...and 1 more figures