Table of Contents
Fetching ...

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao

TL;DR

This work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

Abstract

Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

TL;DR

This work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

Abstract

Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: \textit{(i)} the lack of a unified theoretical framework for guiding the design of steering directions, and \textit{(ii)} an over-reliance on \textit{one-step steering} that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based \textit{theoretical} framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a \textit{barrier function} from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows \textit{empirical} advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for \textit{multi-step and adaptive} steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable improvement over TruthfulQA, over UltraFeedback, and over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.
Paper Structure (35 sections, 2 theorems, 21 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 35 sections, 2 theorems, 21 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

Proposition 1

Suppose $h(\cdot)$ defined in Eq. eq:cbf-boundary satisfies $\dot{h}(\bm{a}) = \nabla_{\bm{a}} h(\bm{a})^\top \bm{v}(\bm{a}) > 0$ for all $\bm{a} \in \mathcal{A}$. Then the set $\mathcal{C} = \{ \bm{a} \in \mathbb{R}^d \mid h(\bm{a}) \ge 0 \}$ is asymptotically stable and forward invariant: any traj

Figures (6)

  • Figure 1: Overview of existing activation steering methods vs. our proposed approach. (a–b) Regular activation addition applies a one-step linear steering $T\cdot \bm{v}(\bm{a})$ to hidden activations, where the vector field $\bm{v}(\bm{a})$ controls the steering direction, and $T$ controls the steering strength, as detailed in Sec \ref{['sec:ode']}. (c–d) Our method (ODESteer) formulates steering as numerically solving an ODE, yielding multi-step adaptive updates from $\bm{a}(0)$ to $\bm{a}(T)$ guided by barrier functions from control theory. (e) The barrier function $h(\bm{a})$ defines desirable and undesirable regions in the activation space, guiding the activations toward desirable regions while ensuring it remains there. (f) Example generations before and after steering show that ODESteer produces more accurate and aligned responses.
  • Figure 2: Visualization of the barrier function $h(\cdot)$ along ODE trajectories.
  • Figure 3: True$\times$Info scores across layers on TruthfulQA for three models using CAA rimsky2024steering. The best-performing layer is selected for steering: 15 for Falcon-7B, 16 for Mistral-7B, and 14 for Llama3.1-8B.
  • Figure 4: The impact of the number of numerical integration steps and the intervention strength $T$ on the True$\times$Info performance of ODESteer on TruthfulQA.
  • Figure 5: The impact of the number of numerical integration steps and the intervention strength $T$ on the True$\times$Info performance of ODESteer using Llama3.1-8B on TruthfulQA.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition 1: ames2016controlames2019control
  • Remark 1
  • Remark 2
  • Proposition 2
  • proof : Proof of Proposition \ref{['prop:barrier-increase']}