Table of Contents
Fetching ...

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash

TL;DR

<3-5 sentence high-level summary>Guard-Railed LLMs use a Guard Model to filter harmful outputs, but PRP shows a two-stage attack that defeats these guards by injecting a universal adversarial prefix for the guard and a propagation prefix into the base LLM's response. The universal adversarial prefix $\Delta_{f_G}$ evades the guard's toxicity detection, while the propagation prefix $p_{\rightarrow \Delta_{f_G}}$ coerces the base LLM to emit a response starting with that adversarial payload. Across open-source and closed-source guard configurations, including no-access threat models, PRP achieves high attack success rates, outperforming prior jailbreak methods. The findings imply current guard-based safety is insufficient and motivate stronger defenses and evaluation frameworks for guard-railed LLM systems.

Abstract

Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

TL;DR

<3-5 sentence high-level summary>Guard-Railed LLMs use a Guard Model to filter harmful outputs, but PRP shows a two-stage attack that defeats these guards by injecting a universal adversarial prefix for the guard and a propagation prefix into the base LLM's response. The universal adversarial prefix evades the guard's toxicity detection, while the propagation prefix coerces the base LLM to emit a response starting with that adversarial payload. Across open-source and closed-source guard configurations, including no-access threat models, PRP achieves high attack success rates, outperforming prior jailbreak methods. The findings imply current guard-based safety is insufficient and motivate stronger defenses and evaluation frameworks for guard-railed LLM systems.

Abstract

Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.
Paper Structure (24 sections, 10 equations, 7 figures, 4 tables)

This paper contains 24 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Guard-Railed LLMs are still not adversarially aligned. Adversarial prompts may be sufficient to jailbreak base model (e.g., Vicuna-33B-Instruct) but can be easily detected by the paired Guard Model (e.g., Llama2-70B-chat). However, our work shows that we can generate adversarial prompts against Guard-Railed LLMs that both jailbreak the base LLM and evade the Guard Model. See \ref{['fig:full-prompt-exp1']} - \ref{['fig:full-prompt-exp4']} for more jailbreak examples.
  • Figure 2: The tradeoff between success of the propagation prefix and the success of the universal adversarial prefix. Longer universal prefixes are generally more successful at evading the Guard Model, but do not propagate as easily.
  • Figure 3: Template for LlamaGuard model. Note the inclusion of several unsafe content categories as shown by the colors.
  • Figure 4: Full prompt example 1 when Vicuna is base LLM and Llama is Guard Model (black-box)
  • Figure 5: Full prompt example 2 when Vicuna is base LLM and Llama is Guard Model (black-box)
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 4.1: Propagation Prefix
  • Definition 4.2: Universal Adversarial Prefix
  • proof : Proof