Table of Contents
Fetching ...

Improved Representation Steering for Language Models

Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, Christopher Potts

TL;DR

The paper tackles the challenge of fine-grained, interpretable control of language models by proposing Reference-free Preference Steering (RePS), a bidirectional, preference-based objective for representation steering. RePS trains intervention-based components (e.g., rank-1 steering vectors, LoRA, LoReFT) without referencing a baseline model and demonstrates strong steering and suppression performance across Gemma models (2B–27B) on AxBench, narrowing the gap to prompting while preserving interpretability and efficiency. Empirical results show RePS-trained interventions surpass standard language modeling objectives and prior BiPO baselines, with particularly strong scaling for larger models and robustness to prompt-based jailbreaking. The work argues that RePS provides a scalable, robust, and interpretable alternative to prompting for both steering and suppression in production-scale LMs, highlighting practical implications for secure and controllable AI systems.

Abstract

Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.

Improved Representation Steering for Language Models

TL;DR

The paper tackles the challenge of fine-grained, interpretable control of language models by proposing Reference-free Preference Steering (RePS), a bidirectional, preference-based objective for representation steering. RePS trains intervention-based components (e.g., rank-1 steering vectors, LoRA, LoReFT) without referencing a baseline model and demonstrates strong steering and suppression performance across Gemma models (2B–27B) on AxBench, narrowing the gap to prompting while preserving interpretability and efficiency. Empirical results show RePS-trained interventions surpass standard language modeling objectives and prior BiPO baselines, with particularly strong scaling for larger models and robustness to prompt-based jailbreaking. The work argues that RePS provides a scalable, robust, and interpretable alternative to prompting for both steering and suppression in production-scale LMs, highlighting practical implications for secure and controllable AI systems.

Abstract

Steering methods for language models (LMs) seek to provide fine-grained and interpretable control over model generations by variously changing model inputs, weights, or representations to adjust behavior. Recent work has shown that adjusting weights or representations is often less effective than steering by prompting, for instance when wanting to introduce or suppress a particular concept. We demonstrate how to improve representation steering via our new Reference-free Preference Steering (RePS), a bidirectional preference-optimization objective that jointly does concept steering and suppression. We train three parameterizations of RePS and evaluate them on AxBench, a large-scale model steering benchmark. On Gemma models with sizes ranging from 2B to 27B, RePS outperforms all existing steering methods trained with a language modeling objective and substantially narrows the gap with prompting -- while promoting interpretability and minimizing parameter count. In suppression, RePS matches the language-modeling objective on Gemma-2 and outperforms it on the larger Gemma-3 variants while remaining resilient to prompt-based jailbreaking attacks that defeat prompting. Overall, our results suggest that RePS provides an interpretable and robust alternative to prompting for both steering and suppression.

Paper Structure

This paper contains 68 sections, 19 equations, 23 figures, 12 tables.

Figures (23)

  • Figure 1: Suppression scores for different defense methods under many-shot jailbreaking attacks with Gemma-3-12B LM. Our suppression score is defined as the harmonic mean of three individual scores measuring adherence to the system prompt (see \ref{['app:rule-judge']}), fluency, and instruction-following. We compare our intervention-based defense, RePS-trained SV, with four prompt-based defenses, including variants of prepending or appending system prompts. Our rewritten system prompts may include in-context examples. The intervention-based method performs on par with the appending system prompt and significantly outperforms the prepending system prompt. The appending system prompt is also prone to leaking out the system prompt (see \ref{['app:system-prompt-leak']}).
  • Figure 2: Mean score breakdown for all methods on our unseen testing instruction set after selecting the optimal factor (based on the Overall Score) on our evaluation instruction set for Gemma-2 models.
  • Figure 3: Distribution of optimal steering factors for each intervention-based methods (LoRA, ReFT and SV) with two objectives (Lang. and RePS) across the 4 tasks with Gemma-2 models.
  • Figure 4: Steering factor vs. scores for Gemma-2 models.
  • Figure 5: Distribution of optimal suppression factors for each intervention-based methods (LoRA, ReFT and SV) with two objectives (Lang. and RePS) across the 4 tasks with Gemma-2 models.
  • ...and 18 more figures