REINFORCE Adversarial Attacks on Large Language Models: An Adaptive, Distributional, and Semantic Objective
Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Vincent Cohen-Addad, Johannes Gasteiger, Stephan Günnemann
TL;DR
The paper addresses robust evaluation of LLM misbehavior by arguing that fixed affirmative-prompt objectives are brittle; it introduces an adaptive, distributional, semantic objective based on REINFORCE to maximize the probability of harmful generations, treating LLMs as generative policies. The core objective is expressed as $\mathbb{E}_{y \sim P_{f_\theta}(Y|X=\tilde{x})}[\operatorname{Reward}(y,\tilde{x})]$, aligning with $P(\operatorname{Harmful}|X=\tilde{x})$ under a distributional view, and is implemented within two state-of-the-art attacks, GCG and PGD. The authors demonstrate substantial improvements over affirmative baselines across multiple models (Llama 2/3, Gemma, Vicuna) and defenses (circuit breakers), with ASR rises up to 50% on hardened setups, and provide detailed ablations on sampling strategies and reward design. This work offers a principled, asymptotically consistent framework for offline red-teaming and evaluation of LLM alignment, informing future defense strategies and safer deployment of large language models.
Abstract
To circumvent the alignment of large language models (LLMs), current optimization-based adversarial attacks usually craft adversarial prompts by maximizing the likelihood of a so-called affirmative response. An affirmative response is a manually designed start of a harmful answer to an inappropriate request. While it is often easy to craft prompts that yield a substantial likelihood for the affirmative response, the attacked model frequently does not complete the response in a harmful manner. Moreover, the affirmative objective is usually not adapted to model-specific preferences and essentially ignores the fact that LLMs output a distribution over responses. If low attack success under such an objective is taken as a measure of robustness, the true robustness might be grossly overestimated. To alleviate these flaws, we propose an adaptive and semantic optimization problem over the population of responses. We derive a generally applicable objective via the REINFORCE policy-gradient formalism and demonstrate its efficacy with the state-of-the-art jailbreak algorithms Greedy Coordinate Gradient (GCG) and Projected Gradient Descent (PGD). For example, our objective doubles the attack success rate (ASR) on Llama3 and increases the ASR from 2% to 50% with circuit breaker defense.
