Table of Contents
Fetching ...

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Ziyang Zhang, Qizhen Zhang, Jakob Foerster

TL;DR

PARDEN proposes a novel, repeat-based defense against LLM jailbreaks that operates in the output space, avoiding finetuning and white-box access. By prompting the model to repeat its own output and measuring fidelity via BLEU between the original and repeated outputs, PARDEN cleanly distinguishes benign from malicious responses while mitigating the auto-regressive trap and domain-shift issues of prior self-classification approaches. Empirical results on Llama-2-7B and Claude-2.1 show substantial reductions in false positives at high true-positive rates, outperforming existing baselines and enabling a robust, practical jailbreak defense. The work also provides datasets and prompts for standardized benchmarking, highlighting the method’s generalisability and potential for integration with input-space defenses in a broader safety pipeline.

Abstract

Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. Code and data are available at https://github.com/Ed-Zh/PARDEN. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset.

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

TL;DR

PARDEN proposes a novel, repeat-based defense against LLM jailbreaks that operates in the output space, avoiding finetuning and white-box access. By prompting the model to repeat its own output and measuring fidelity via BLEU between the original and repeated outputs, PARDEN cleanly distinguishes benign from malicious responses while mitigating the auto-regressive trap and domain-shift issues of prior self-classification approaches. Empirical results on Llama-2-7B and Claude-2.1 show substantial reductions in false positives at high true-positive rates, outperforming existing baselines and enabling a robust, practical jailbreak defense. The work also provides datasets and prompts for standardized benchmarking, highlighting the method’s generalisability and potential for integration with input-space defenses in a broader safety pipeline.

Abstract

Large language models (LLMs) have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. Code and data are available at https://github.com/Ed-Zh/PARDEN. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset.
Paper Structure (38 sections, 5 equations, 6 figures, 7 tables)

This paper contains 38 sections, 5 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Examples of PARDEN. Top: PARDEN is unable to repeat the LLM output generated from a malicious user input. Hence, the BLEU score between the LLM output and PARDEN repeat falls below the similarity threshold, a hyper-parameter of the method. Thus, PARDEN classifies the user input as malicious and returns to the user the repeated output instead of the original output. Bottom: PARDEN repeats almost exactly the LLM output. Hence, the BLEU score is near-perfect (with mean 0.946, std 0.0867), and PARDEN classifies the user input as non-malicious. PARDEN thus returns the original LLM output to the user.
  • Figure 2: The Receiver Operating Characteristic (ROC) curves of PARDEN and baseline methods Left: ROC curves on the dataset composed of GCG examples zou2023universal and benign examples we collected. Right: ROC curves on the dataset composed of AutoDan zhu2023autodan examples and benign examples. We include errorbars equal to 1 std of our estimates.
  • Figure 3: Left: The ROC curves of PARDEN and baseline methods on the aggregated dataset from \ref{['fig:gcgautodan_roc']}Right: The BLEU scores between x and REPEAT(x) are close to 1 (with mean 0.946, std 0.0867) for benign samples and around 0.4 (with mean 0.435, std 0.157) for malicious ones.
  • Figure 4: ROC curves for different numbers of repeat tokens.
  • Figure 5: BLEU distribution, harmful strings
  • ...and 1 more figures