Table of Contents
Fetching ...

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Alwin Peng, Julian Michael, Henry Sleight, Ethan Perez, Mrinank Sharma

TL;DR

This work develops RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples, and develops rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.

Abstract

As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

TL;DR

This work develops RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples, and develops rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.

Abstract

As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.

Paper Structure

This paper contains 42 sections, 6 figures.

Figures (6)

  • Figure 1: Comparison of traditional robustness and rapid response for mitigating LLM jailbreaking. Traditional adversarial robustness aims to develop a highly robust static system that resists all possible jailbreak attempts. However, even state-of-the-art defenses are often quickly defeated by persistent attackers. In contrast, rapid response emphasizes effective monitoring to quickly detect novel jailbreaks, and then rapidly adapting the system to defend against detected attacks.
  • Figure 2: Rapid response methods effectively mitigate jailbreak attacks with limited examples, but performance varies across methods. We examine the performance of our baseline methods across varying numbers of examples per jailbreaking strategy, averaged over three target models: GPT-4o, Llama-3-Instruct-8B, and Mistral-7B-Instruct-v0.2. (a) Attack success rates (ASR) on the in-distribution test set decrease as more examples are observed. Guard Fine-tuning and Regex show high sample efficiency, achieving a greater than 15-fold ASR reduction with just one example per strategy. (b) ASR on out-of-distribution (OOD) attack variants also decreases with more observed examples. All methods reduce OOD ASR, but Guard Fine-tuning exhibits the best performance and generalization. (c) Refusal rates on benign WildChat queries generally increase with rapid response, but scaling behavior on the number of shots varies by response method. See \ref{['app:extended-results']} for results per target model and jailbreaking strategy.
  • Figure 3: Improving proliferation enhances the effectiveness of rapid response techniques. We examine the impact of proliferation on the average attack success rate (ASR) across the combined in-distribution and out-of-distribution test sets. (a) Varying the capability of the proliferation model, measured by the model's HELM MMLU Liang2023HolisticEO score, shows inconsistent effects across different defense methods. Guard Fine-tuning however, benefits substantially from more capable models. (b) Varying the number of proliferation attempts per jailbreaking strategy generally improves the performance of rapid response techniques, with the strongest method, Guard Fine-tuning, benefiting the most from increased proliferation. Overall, these results demonstrate that enhancing proliferation techniques, both in terms of model capability and the number of attempts, can significantly strengthen rapid response defenses against jailbreaking attempts.
  • Figure 4: Rapid response performance split across target models.(a) Attack success rates on the in-distribution test set (b) Attack success rates on the out-of-distribution test set (c) Refusal rates on WildChat
  • Figure 5: Rapid response performance split across attacks.(a) Attack success rates on the in-distribution test set (b) Attack success rates on the out-of-distribution test set
  • ...and 1 more figures