Improved Representation Steering for Language Models

Zhengxuan Wu; Qinan Yu; Aryaman Arora; Christopher D. Manning; Christopher Potts

Improved Representation Steering for Language Models

Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, Christopher Potts

TL;DR

The paper tackles the challenge of fine-grained, interpretable control of language models by proposing Reference-free Preference Steering (RePS), a bidirectional, preference-based objective for representation steering. RePS trains intervention-based components (e.g., rank-1 steering vectors, LoRA, LoReFT) without referencing a baseline model and demonstrates strong steering and suppression performance across Gemma models (2B–27B) on AxBench, narrowing the gap to prompting while preserving interpretability and efficiency. Empirical results show RePS-trained interventions surpass standard language modeling objectives and prior BiPO baselines, with particularly strong scaling for larger models and robustness to prompt-based jailbreaking. The work argues that RePS provides a scalable, robust, and interpretable alternative to prompting for both steering and suppression in production-scale LMs, highlighting practical implications for secure and controllable AI systems.

Abstract

Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.

Improved Representation Steering for Language Models

TL;DR

Abstract

Improved Representation Steering for Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)