Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
TL;DR
This work presents a formal and empirical study of representation surgery via affine steering to mitigate undesirable LM behavior. It derives two optimal affine steering operators under guardedness constraints: mean matching, which shifts representations by $μ_c' - μ_c$, and mean+covariance matching, which also aligns second-order statistics through a structured $W^*$; both guarantee reduced bias while preserving content. The authors connect steering to affine concept erasure and optimal transport, and show that steering can eliminate bias by neighbors in expectation. Across tasks—biography fairness, dialect Bias in sentiment, and toxicity in generation—the mean+covariance approach consistently improves fairness metrics and toxicity mitigation with modest impact on primary task performance. Overall, the paper provides a principled, interpretable framework for minimal, effective intervention of LM representations with practical implications for responsible AI deployment, while acknowledging limitations and avenues for nonlinear extensions.
Abstract
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
