Representation Surgery: Theory and Practice of Affine Steering

Shashwat Singh; Shauli Ravfogel; Jonathan Herzig; Roee Aharoni; Ryan Cotterell; Ponnurangam Kumaraguru

Representation Surgery: Theory and Practice of Affine Steering

Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

TL;DR

This work presents a formal and empirical study of representation surgery via affine steering to mitigate undesirable LM behavior. It derives two optimal affine steering operators under guardedness constraints: mean matching, which shifts representations by $μ_c' - μ_c$, and mean+covariance matching, which also aligns second-order statistics through a structured $W^*$; both guarantee reduced bias while preserving content. The authors connect steering to affine concept erasure and optimal transport, and show that steering can eliminate bias by neighbors in expectation. Across tasks—biography fairness, dialect Bias in sentiment, and toxicity in generation—the mean+covariance approach consistently improves fairness metrics and toxicity mitigation with modest impact on primary task performance. Overall, the paper provides a principled, interpretable framework for minimal, effective intervention of LM representations with practical implications for responsible AI deployment, while acknowledging limitations and avenues for nonlinear extensions.

Abstract

Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.

Representation Surgery: Theory and Practice of Affine Steering

TL;DR

, and mean+covariance matching, which also aligns second-order statistics through a structured

; both guarantee reduced bias while preserving content. The authors connect steering to affine concept erasure and optimal transport, and show that steering can eliminate bias by neighbors in expectation. Across tasks—biography fairness, dialect Bias in sentiment, and toxicity in generation—the mean+covariance approach consistently improves fairness metrics and toxicity mitigation with modest impact on primary task performance. Overall, the paper provides a principled, interpretable framework for minimal, effective intervention of LM representations with practical implications for responsible AI deployment, while acknowledging limitations and avenues for nonlinear extensions.

Abstract

Paper Structure (36 sections, 10 theorems, 62 equations, 4 figures, 5 tables)

This paper contains 36 sections, 10 theorems, 62 equations, 4 figures, 5 tables.

Introduction
Preliminaries
Representation Surgery.
Affine Concept Erasure
Affine Steering Functions
Least-Squares Steering
Beyond Mean Matching: Second Moment Matching
Connection to Optimal Transport.
Bias by Neighbors.
Experiments
Regularization for rank deficiency
Fairness in Multiclass Classification
Counterfactuals for Fairness.
Quantifying Bias.
Steering Methods.
...and 21 more sections

Key Result

Theorem 3.1

Let ${\color{MacroColor} \mathcal{V}}$ be the family of affine predictors. Then, the following are equivalent. 1) An intervention function ${\color{MacroColor} f}$$({\color{MacroColor} \mathcal{V}}, \color{MacroColor} \mathcal{L})$-affinely guards $\boldsymbol{{\color{MacroColor} \mathrm{\color{Macr

Figures (4)

Figure 1: Left: A steering function $f(\cdot)$ is fit to map representations of a source concept (red) to a target concept (blue). Right: An illustration of an application of the fit steering function $f(\cdot)$ during autoregressive generation to mitigate toxicity.
Figure 2: Cosine similarity, on a log scale, between 4000 random samples in the development set (LLama2-7b model). The first 2000 rows are representations of male biographies, while the latter 2000 are representations of female biographies. The block-diagonal structure, which suggests bias by neighbor, vanishes after the application of our affine steering functions.
Figure 3: Percentage of top-$k$ neighbors that share gender label as a function of $k$.
Figure 4: $\color{MacroColor}{\color{MacroColor}{\text{TPR}}}_{\text{RMS}}$ versus percentage of AAE in the positive sentiment concept.

Theorems & Definitions (21)

Definition 3.1: Affine Guardedness
Theorem 3.1: belrose2024leace
proof
Theorem 3.2: LEACE; belrose2024leace
proof
Proposition 4.1
proof
Proposition 4.2
proof
Proposition 4.1: knott1984optimal
...and 11 more

Representation Surgery: Theory and Practice of Affine Steering

TL;DR

Abstract

Representation Surgery: Theory and Practice of Affine Steering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (21)