Table of Contents
Fetching ...

Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods

Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua

TL;DR

The paper analyzes inference-time steering via representation engineering to align language models with desired behaviors, while quantifying the concomitant impact on usefulness. It develops a theoretical framework yielding a linear-in-r_e alignment bound and a quadratic-in-r_e helpfulness bound, identifying a regime where small steering yields net benefit. Empirical experiments across multiple LLMs validate the predicted trends, demonstrating alignment gains with manageable losses in performance and providing practical guidance for steering strength. The work offers a principled, testable approach to safe, controllable alignment at inference time, with implications for real-time AI safety and future research on more nuanced behavior scoring.

Abstract

Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with representation engineering, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.

Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods

TL;DR

The paper analyzes inference-time steering via representation engineering to align language models with desired behaviors, while quantifying the concomitant impact on usefulness. It develops a theoretical framework yielding a linear-in-r_e alignment bound and a quadratic-in-r_e helpfulness bound, identifying a regime where small steering yields net benefit. Empirical experiments across multiple LLMs validate the predicted trends, demonstrating alignment gains with manageable losses in performance and providing practical guidance for steering strength. The work offers a principled, testable approach to safe, controllable alignment at inference time, with implications for real-time AI safety and future research on more nuanced behavior scoring.

Abstract

Language model alignment has become an important component of AI safety, allowing safe interactions between humans and language models, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with representation engineering, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.
Paper Structure (34 sections, 10 theorems, 81 equations, 17 figures)

This paper contains 34 sections, 10 theorems, 81 equations, 17 figures.

Key Result

Theorem 1

Let $P_{\theta,r_e}(\cdot|q)$ be a model prompted with query $q$ and injected with representations of coefficient $r_e$. Let $B:\Sigma^*\rightarrow \{-1,+1\}$ be a behavior scoring function. The injections to all layers amounts to a change in the final hidden layer representation that is $q$ depende Where $B_0 = B[P_\theta(\cdot|q)]$ is the behavior expectation without steering and $\lambda$ is a

Figures (17)

  • Figure 1: Effect of steering on helpfulness and alignment. Our main results show that alignment can improve at the cost of helpfulness. Moreover, we show that for small representation engineering norms the helpfulness decreases quadratically while the alignment increase is linear, so there is a regime in which representation engineering can be cost-effective.
  • Figure 2: (a) The change to the last hidden layer due to vector injections from previous layers classifies positive and negative answer representations. (b) Plot of the upper bound on behavior expectation in theorem \ref{['theorem:1']}.
  • Figure 3: (a) Directionality of change to last hidden layer due to representation engineering distributes randomly with variance $\sigma^2$w.r.t. correct and incorrect answer representations. (b) Plot of helpfulness bound with given parameters of $P_0$, $\alpha$ and $\lambda\sigma\beta$.
  • Figure 4: Plots of behavior expectation as a function of the coefficients of representation engineering vectors injected to the model. The blue line is the direct measurement, the orange line is a plot of the bound from theorem \ref{['theorem:1']}. (a) Harmless behavior expectation of Llama 2 13B as a function of coefficient of injected harmful PCA vectors. (b) Racism behavior expectation of Llama 2 13B as a function of coefficient of injected bias PCA vectors.(c) Harmful behavior expectation of Llama 2 13B as a function of coefficient of injected harmful PCA vectors. (d) Racism behavior expectation of Llama 2 13B chat as a function of coefficient of injected bias PCA vectors.
  • Figure 5: Helpfulness measurement: the probability assigned to the correct answer to questions from different MMLU tests (international law, medical genetics, high school computer science), as a function of representation engineering vector coefficients injected to the model. Here the probability of the correct answer was measured relative to the answers A, B, C, D. The red line plots the bound of theorem \ref{['theorem:2']} for free parameters on "international law". (a) Helpfulness of Llama 2 13B with harmful PCA vectors. (b) Helpfulness of Llama 2 13B with bias PCA vectors. (c) Helpfulness of Llama 2 13B chat with harmful PCA vectors. (d) Helpfulness of Llama 2 13B chat with bias PCA vectors.
  • ...and 12 more figures

Theorems & Definitions (11)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Theorem 3
  • Proposition 1
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • ...and 1 more