Table of Contents
Fetching ...

Are PPO-ed Language Models Hackable?

Suraj Anand, David Getzen

TL;DR

This work investigates the vulnerability of PPO-based alignment for large language models by examining how a static sentiment reward shapes positive-language generation and whether end-user access can jailbreak the aligned model. Using mechanistic interpretability (logit lens) and a vocabulary-space view, the authors identify negative sentiment vectors that PPO tends to suppress only via minor weight changes, while activations shift to promote positivity. They demonstrate a practical jailbreak by scaling negative concept activations, and explore a regularization approach to erode negative weights, though stability remains elusive. Overall, the paper highlights the brittleness of PPO alignment and motivates development of more robust, mechanistically informed defenses against reward hacking in deployed models.

Abstract

Numerous algorithms have been proposed to $\textit{align}$ language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a reward model based on human feedback, we employ a statically learned sentiment classifier. We also consider a setting where our model's weights and activations are exposed to an end-user after training. We examine a pretrained GPT-2 through the lens of mechanistic interpretability before and after proximal policy optimization (PPO) has been applied to promote positive sentiment responses. Using these insights, we (1) attempt to "hack" the PPO-ed model to generate negative sentiment responses and (2) add a term to the reward function to try and alter `negative' weights.

Are PPO-ed Language Models Hackable?

TL;DR

This work investigates the vulnerability of PPO-based alignment for large language models by examining how a static sentiment reward shapes positive-language generation and whether end-user access can jailbreak the aligned model. Using mechanistic interpretability (logit lens) and a vocabulary-space view, the authors identify negative sentiment vectors that PPO tends to suppress only via minor weight changes, while activations shift to promote positivity. They demonstrate a practical jailbreak by scaling negative concept activations, and explore a regularization approach to erode negative weights, though stability remains elusive. Overall, the paper highlights the brittleness of PPO alignment and motivates development of more robust, mechanistically informed defenses against reward hacking in deployed models.

Abstract

Numerous algorithms have been proposed to language models to remove undesirable behaviors. However, the challenges associated with a very large state space and creating a proper reward function often result in various jailbreaks. Our paper aims to examine this effect of reward in the controlled setting of positive sentiment language generation. Instead of online training of a reward model based on human feedback, we employ a statically learned sentiment classifier. We also consider a setting where our model's weights and activations are exposed to an end-user after training. We examine a pretrained GPT-2 through the lens of mechanistic interpretability before and after proximal policy optimization (PPO) has been applied to promote positive sentiment responses. Using these insights, we (1) attempt to "hack" the PPO-ed model to generate negative sentiment responses and (2) add a term to the reward function to try and alter `negative' weights.
Paper Structure (17 sections, 9 equations, 4 figures, 1 table)

This paper contains 17 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Our controlled setup employs a pretrained GPT-2 model to generate sequences (rollouts) and a DistilBERT classifier to evaluate the sentiment (reward).
  • Figure 2: Histogram of sentiment of responses in heldout test set pre and post PPO the GPT-2.
  • Figure 3: Weights are minimally changed by Proximal Policy Optimization of the full GPT-2 model.
  • Figure 4: (Left) Logit lens shows that negative concept activations scaled down across model residual stream (this is specifically for the word sh*t). (Right) The activation differences for the ten most 'negative' value vectors.