PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

Neal Mangaokar; Ashish Hooda; Jihye Choi; Shreyas Chandrashekaran; Kassem Fawaz; Somesh Jha; Atul Prakash

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash

TL;DR

<3-5 sentence high-level summary>Guard-Railed LLMs use a Guard Model to filter harmful outputs, but PRP shows a two-stage attack that defeats these guards by injecting a universal adversarial prefix for the guard and a propagation prefix into the base LLM's response. The universal adversarial prefix $\Delta_{f_G}$ evades the guard's toxicity detection, while the propagation prefix $p_{\rightarrow \Delta_{f_G}}$ coerces the base LLM to emit a response starting with that adversarial payload. Across open-source and closed-source guard configurations, including no-access threat models, PRP achieves high attack success rates, outperforming prior jailbreak methods. The findings imply current guard-based safety is insufficient and motivate stronger defenses and evaluation frameworks for guard-railed LLM systems.

Abstract

Large language models (LLMs) are typically aligned to be harmless to humans. Unfortunately, recent work has shown that such models are susceptible to automated jailbreak attacks that induce them to generate harmful content. More recent LLMs often incorporate an additional layer of defense, a Guard Model, which is a second LLM that is designed to check and moderate the output response of the primary LLM. Our key contribution is to show a novel attack strategy, PRP, that is successful against several open-source (e.g., Llama 2) and closed-source (e.g., GPT 3.5) implementations of Guard Models. PRP leverages a two step prefix-based attack that operates by (a) constructing a universal adversarial prefix for the Guard Model, and (b) propagating this prefix to the response. We find that this procedure is effective across multiple threat models, including ones in which the adversary has no access to the Guard Model at all. Our work suggests that further advances are required on defenses and Guard Models before they can be considered effective.

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

TL;DR

evades the guard's toxicity detection, while the propagation prefix

coerces the base LLM to emit a response starting with that adversarial payload. Across open-source and closed-source guard configurations, including no-access threat models, PRP achieves high attack success rates, outperforming prior jailbreak methods. The findings imply current guard-based safety is insufficient and motivate stronger defenses and evaluation frameworks for guard-railed LLM systems.

Abstract

Paper Structure (24 sections, 10 equations, 7 figures, 4 tables)

This paper contains 24 sections, 10 equations, 7 figures, 4 tables.

Introduction
Related Works
Preliminaries
Notations
Attack against Guard-Railed LLMs.
Threat Model
Method
Universal Adversarial Prefix
Propagation Prefix
Experiments
Setup
Results
RQ1: Efficacy of in White-Box and Black-Box Settings
RQ2: Efficacy of in No Access Settings
RQ3: Do Guard Models Offer any Additional Safety?
...and 9 more sections

Figures (7)

Figure 1: Guard-Railed LLMs are still not adversarially aligned. Adversarial prompts may be sufficient to jailbreak base model (e.g., Vicuna-33B-Instruct) but can be easily detected by the paired Guard Model (e.g., Llama2-70B-chat). However, our work shows that we can generate adversarial prompts against Guard-Railed LLMs that both jailbreak the base LLM and evade the Guard Model. See \ref{['fig:full-prompt-exp1']} - \ref{['fig:full-prompt-exp4']} for more jailbreak examples.
Figure 2: The tradeoff between success of the propagation prefix and the success of the universal adversarial prefix. Longer universal prefixes are generally more successful at evading the Guard Model, but do not propagate as easily.
Figure 3: Template for LlamaGuard model. Note the inclusion of several unsafe content categories as shown by the colors.
Figure 4: Full prompt example 1 when Vicuna is base LLM and Llama is Guard Model (black-box)
Figure 5: Full prompt example 2 when Vicuna is base LLM and Llama is Guard Model (black-box)
...and 2 more figures

Theorems & Definitions (3)

Definition 4.1: Propagation Prefix
Definition 4.2: Universal Adversarial Prefix
proof : Proof

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

TL;DR

Abstract

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (3)