Table of Contents
Fetching ...

Effectively Steer LLM To Follow Preference via Building Confident Directions

Bingqing Song, Boran Han, Shuai Zhang, Hao Wang, Haoyang Fang, Bonan Min, Yuyang Wang, Mingyi Hong

TL;DR

The paper addresses inference-time alignment of LLM outputs to user preferences by formulating a Bayesian framework for model steering and introducing Confident Direction Steering (CONFST). CONFST identifies a confident steering direction from user history using activations and a lightweight classifier, enabling multi-direction alignment without explicit prompts or layer-wide search. The approach yields theoretically grounded guarantees about steering quality and demonstrates empirical improvements over mean steering across topic and style shifts on GPT-2-XL, Mistral, and Gemma models. This work advances practical, scalable, instruction-free personalization of LLM outputs with principled steering directions and additive multi-feature steering capabilities.

Abstract

Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users' preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users' preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.

Effectively Steer LLM To Follow Preference via Building Confident Directions

TL;DR

The paper addresses inference-time alignment of LLM outputs to user preferences by formulating a Bayesian framework for model steering and introducing Confident Direction Steering (CONFST). CONFST identifies a confident steering direction from user history using activations and a lightweight classifier, enabling multi-direction alignment without explicit prompts or layer-wide search. The approach yields theoretically grounded guarantees about steering quality and demonstrates empirical improvements over mean steering across topic and style shifts on GPT-2-XL, Mistral, and Gemma models. This work advances practical, scalable, instruction-free personalization of LLM outputs with principled steering directions and additive multi-feature steering capabilities.

Abstract

Having an LLM that aligns with human preferences is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest. The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which modify the model output by constructing specific steering directions, are typically easy to implement and optimization-free. However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidirectional steering), and there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations at inference time. More specifically, CONFST builds a confident direction that is closely aligned with users' preferences, and this direction is then added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) It is more powerful, since multiple (i.e. more than two) users' preferences can be aligned simultaneously; 2) It is simple to implement, since there is no need to determine which layer to add the steering vector to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across various topics and styles, achieving superior performance over competing methods.

Paper Structure

This paper contains 24 sections, 25 equations, 20 figures, 1 table, 1 algorithm.

Figures (20)

  • Figure 1: LLM personalized answer. Different users may input similar even same question. However, users have different expectations on the responses generated by LLMs.
  • Figure 2: Framework of the model steering algorithm: The steering direction is inferred from the user's history and then fed back to the LLM to steer the output according to the user's preferences.
  • Figure 3: The framework of confident direction selection method. A classifier $\mathcal{C}$ is trained to determine the confident level of activations, while the ones above $\beta$ are selected and averaged to derive $v$.
  • Figure 4: Agnews: x-axis is confidence threshold $\beta$, y-axis is averaged probability the content belongs to each class. Generally, larger $\beta$ induces higher success rate towards the target direction.
  • Figure 5: Emotion: x and y axis are the same as Agnews steering. Emotion dataset contains higher noise, thus in some steering direction the success rate can drop when $\beta$ increases.
  • ...and 15 more figures

Theorems & Definitions (8)

  • Remark 1
  • Claim 1
  • Remark 2
  • Claim 2
  • Remark 3
  • Remark 4
  • proof
  • proof