Table of Contents
Fetching ...

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann

TL;DR

The paper addresses robust evaluation of LLM misbehavior by arguing that fixed affirmative-prompt objectives are brittle; it introduces an adaptive, distributional, semantic objective based on REINFORCE to maximize the probability of harmful generations, treating LLMs as generative policies. The core objective is expressed as $\mathbb{E}_{y \sim P_{f_\theta}(Y|X=\tilde{x})}[\operatorname{Reward}(y,\tilde{x})]$, aligning with $P(\operatorname{Harmful}|X=\tilde{x})$ under a distributional view, and is implemented within two state-of-the-art attacks, GCG and PGD. The authors demonstrate substantial improvements over affirmative baselines across multiple models (Llama 2/3, Gemma, Vicuna) and defenses (circuit breakers), with ASR rises up to 50% on hardened setups, and provide detailed ablations on sampling strategies and reward design. This work offers a principled, asymptotically consistent framework for offline red-teaming and evaluation of LLM alignment, informing future defense strategies and safer deployment of large language models.

Abstract

To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.

REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective

TL;DR

The paper addresses robust evaluation of LLM misbehavior by arguing that fixed affirmative-prompt objectives are brittle; it introduces an adaptive, distributional, semantic objective based on REINFORCE to maximize the probability of harmful generations, treating LLMs as generative policies. The core objective is expressed as , aligning with under a distributional view, and is implemented within two state-of-the-art attacks, GCG and PGD. The authors demonstrate substantial improvements over affirmative baselines across multiple models (Llama 2/3, Gemma, Vicuna) and defenses (circuit breakers), with ASR rises up to 50% on hardened setups, and provide detailed ablations on sampling strategies and reward design. This work offers a principled, asymptotically consistent framework for offline red-teaming and evaluation of LLM alignment, informing future defense strategies and safer deployment of large language models.

Abstract

To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.

Paper Structure

This paper contains 22 sections, 11 equations, 6 figures, 9 tables, 5 algorithms.

Figures (6)

  • Figure 1: Responses that our REINFORCE and the affirmative objective encourage. Due to the popularity of the affirmative objective, it is likely that eliminating the hatched region will be prioritized while developing future models.
  • Figure 2: Attack on Gemma 1.1 7B deepmind_gemma_2024. Even though GCG zou_universal_2023 with (a) affirmative objective finds an adversarial suffix s.t. the model starts its response with target affirmation with $>60\%$ chance, almost attaining an ideal outcome, the model completes the response harmlessly, almost mocking the attacker. (b) In contrast, our REINFORCE objective successfully disables the model's alignment. We denote redactions with "*".
  • Figure 3: Our REINFORCE objective provides a good ASR@128/runtime tradeoff in contrast to GCG with its affirmative response objective. Here we show the reward of the judge used during the attack. Runtimes are for H100s.
  • Figure 4: Ablations of search and selection strategies for GCG on Llama 3 8B and ASR@512. We either select the mutated candidates (a) randomly, (b) using the gradient of the affirmative response, (c) or REINFORCE. We select the best candidate either according to (1) the affirmative response or (2) REINFORCE.
  • Figure 5: Random example for an attack on Llama 3 8B (first 50 steps). As we show in (a), in attack step 7, the model's random generation ${\bm{y}}_{\text{random}}$ is harmful, which we then include as ${\bm{y}}_{\text{harmful}}$. In step 17, also ${\bm{y}}_{\text{greedy}}$ becomes harmful. Thereafter, also the harmfulness of random generations rises (see moving average $\operatorname{MA}$). As shown in (b), already small changes in $P_{f_\theta}({\bm{y}}_{\text{seed}}) = - \operatorname{CE}({\bm{y}}_{\text{seed}})$ may suffice to obtain harmful generations. Specifically, $\operatorname{CE}({\bm{y}}_{\text{seed}})$ decreases for the first iterations and increases again after the greedy generation becomes harmful. (c-e) show histograms of the mutation's $\operatorname{CE}$s.
  • ...and 1 more figures