Table of Contents
Fetching ...

Fine-Grained Activation Steering: Steering Less, Achieving More

Zijian Feng, Tianjiao Li, Zixiao Zhu, Hanzhang Zhou, Junlang Qian, Li Zhang, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao

TL;DR

This work reveals that block-level activation steering is inherently limited by heterogeneous AU-level contributions within layers of transformers. It introduces AU-level activations and AUSteer, which localizes discriminative AUs using activation momentum and applies adaptive, input-dependent steering, updating activations via $\hat{x}_i = x_i + \gamma_i x_i$. Across diverse tasks and large models, AUSteer consistently outperforms block-level baselines while intervening on far fewer activations, demonstrating that steering less can achieve more and offering a scalable, efficient approach for fine-grained activation control in LLMs.

Abstract

Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.

Fine-Grained Activation Steering: Steering Less, Achieving More

TL;DR

This work reveals that block-level activation steering is inherently limited by heterogeneous AU-level contributions within layers of transformers. It introduces AU-level activations and AUSteer, which localizes discriminative AUs using activation momentum and applies adaptive, input-dependent steering, updating activations via . Across diverse tasks and large models, AUSteer consistently outperforms block-level baselines while intervening on far fewer activations, demonstrating that steering less can achieve more and offering a scalable, efficient approach for fine-grained activation control in LLMs.

Abstract

Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.
Paper Structure (30 sections, 11 equations, 13 figures, 13 tables)

This paper contains 30 sections, 11 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Comparison of block-level steering (prior work) and AU-level steering (Ours).
  • Figure 2: Heterogeneous steering results for MHA and FFNs.
  • Figure 3: Pairwise KL divergence when steering different AUs. $s$ means strength.
  • Figure 4: Top-k deceode tokens controlled by different AUs. The answer to input prompt is "yes".
  • Figure 5: Overview of AUSteer: (1) AU localization using activation momentum and discriminative scores; and (2) Adaptive steering across diverse inputs and AUs.
  • ...and 8 more figures