Table of Contents
Fetching ...

Representation Surgery: Theory and Practice of Affine Steering

Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

TL;DR

This work presents a formal and empirical study of representation surgery via affine steering to mitigate undesirable LM behavior. It derives two optimal affine steering operators under guardedness constraints: mean matching, which shifts representations by $μ_c' - μ_c$, and mean+covariance matching, which also aligns second-order statistics through a structured $W^*$; both guarantee reduced bias while preserving content. The authors connect steering to affine concept erasure and optimal transport, and show that steering can eliminate bias by neighbors in expectation. Across tasks—biography fairness, dialect Bias in sentiment, and toxicity in generation—the mean+covariance approach consistently improves fairness metrics and toxicity mitigation with modest impact on primary task performance. Overall, the paper provides a principled, interpretable framework for minimal, effective intervention of LM representations with practical implications for responsible AI deployment, while acknowledging limitations and avenues for nonlinear extensions.

Abstract

Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.

Representation Surgery: Theory and Practice of Affine Steering

TL;DR

This work presents a formal and empirical study of representation surgery via affine steering to mitigate undesirable LM behavior. It derives two optimal affine steering operators under guardedness constraints: mean matching, which shifts representations by , and mean+covariance matching, which also aligns second-order statistics through a structured ; both guarantee reduced bias while preserving content. The authors connect steering to affine concept erasure and optimal transport, and show that steering can eliminate bias by neighbors in expectation. Across tasks—biography fairness, dialect Bias in sentiment, and toxicity in generation—the mean+covariance approach consistently improves fairness metrics and toxicity mitigation with modest impact on primary task performance. Overall, the paper provides a principled, interpretable framework for minimal, effective intervention of LM representations with practical implications for responsible AI deployment, while acknowledging limitations and avenues for nonlinear extensions.

Abstract

Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
Paper Structure (36 sections, 10 theorems, 62 equations, 4 figures, 5 tables)

This paper contains 36 sections, 10 theorems, 62 equations, 4 figures, 5 tables.

Key Result

Theorem 3.1

Let ${\color{MacroColor} \mathcal{V}}$ be the family of affine predictors. Then, the following are equivalent. 1) An intervention function ${\color{MacroColor} f}$$({\color{MacroColor} \mathcal{V}}, \color{MacroColor} \mathcal{L})$-affinely guards $\boldsymbol{{\color{MacroColor} \mathrm{\color{Macr

Figures (4)

  • Figure 1: Left: A steering function $f(\cdot)$ is fit to map representations of a source concept (red) to a target concept (blue). Right: An illustration of an application of the fit steering function $f(\cdot)$ during autoregressive generation to mitigate toxicity.
  • Figure 2: Cosine similarity, on a log scale, between 4000 random samples in the development set (LLama2-7b model). The first 2000 rows are representations of male biographies, while the latter 2000 are representations of female biographies. The block-diagonal structure, which suggests bias by neighbor, vanishes after the application of our affine steering functions.
  • Figure 3: Percentage of top-$k$ neighbors that share gender label as a function of $k$.
  • Figure 4: $\color{MacroColor}{\color{MacroColor}{\text{TPR}}}_{\text{RMS}}$ versus percentage of AAE in the positive sentiment concept.

Theorems & Definitions (21)

  • Definition 3.1: Affine Guardedness
  • Theorem 3.1: belrose2024leace
  • proof
  • Theorem 3.2: LEACE; belrose2024leace
  • proof
  • Proposition 4.1
  • proof
  • Proposition 4.2
  • proof
  • Proposition 4.1: knott1984optimal
  • ...and 11 more