Table of Contents
Fetching ...

Activation Steering with a Feedback Controller

Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen

TL;DR

This work develops a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal, and proposes Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs.

Abstract

Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.

Activation Steering with a Feedback Controller

TL;DR

This work develops a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal, and proposes Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs.

Abstract

Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.

Paper Structure

This paper contains 44 sections, 13 theorems, 144 equations, 8 figures, 3 tables.

Key Result

Proposition 1

P-control activation steering ensures input-to-state stability (ISS) for an appropriate range of $K_p$. However, there still exists a steady-state error due to the disturbance ${\bm{w}}(k)$ to the state of the system. In the best case, when ${\bm{w}}(k)$ converges to ${\bm{w}}$, under a mild conditi

Figures (8)

  • Figure 1: Our paper connects LLM Behavior Control, Feature Attribution for LLM and Control Theory. Specifically, we apply a PID-Controller to compute the steering vector for activation steering.
  • Figure 2: PID Steering: To compute the steering vector $u(k)$: a PID controller is applied at every layer $f^{(k)}(\cdot)$, using the diff-in-means between 2 contrastive data $x_{sp}(k)$ and $x(k)$ as the error signal $e(k)$.
  • Figure 3: Scalar errors across time step of randomly initialized model after applying P, PI, and PID controller.
  • Figure 4: Qualitative results of activation steering in FLUX-Schnell across two style concepts with the prompt "Lady bent over with red polka dot umbrella inside a brick building."
  • Figure 5: 0-shot and CLIPScore results for 'cyperpunk' and 'steampunk' concept.
  • ...and 3 more figures

Theorems & Definitions (20)

  • Proposition 1: Steady-state error of P-control activation steering
  • Lemma 1: Discretizing PID steering vector
  • Definition 1: PID Steering
  • Proposition 2: Error dynamics of activation steering
  • Proposition 3: Stabilizing the PI loop reduces steady-state error
  • Theorem 1: Stabilizing the PID loop preserves bias removal
  • Theorem 2: PID reduces the first-overshoot amplitude
  • Lemma 1: Discretizing PID steering vector
  • Proposition 3: Error dynamics of activation steering
  • Proposition 3: Steady-state error of P-control activation steering
  • ...and 10 more