Table of Contents
Fetching ...

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

Weixuan Wang, Jingyuan Yang, Wei Peng

TL;DR

SADI introduces a training-free, semantics-aware approach to steer LLMs at inference by constructing a dynamic steering vector from input-specific activation differences. It identifies critical components via a contrastive-difference analysis, builds a top-$K$ binary mask, and applies an adaptive, input-aligned update with strength $\delta$ to the last-token activations across layers, heads, or FFNs. Across four backbones and eleven tasks, SADI substantially outperforms fixed steering and random interventions, with notable gains on attention-head interventions and robust generalization to multilingual and few-shot settings. The method requires only ~150 contrastive examples to build the mask and does not require training, offering a cost-effective, broadly applicable activation-intervention technique for LLM alignment with strong practical potential.

Abstract

Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique.

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

TL;DR

SADI introduces a training-free, semantics-aware approach to steer LLMs at inference by constructing a dynamic steering vector from input-specific activation differences. It identifies critical components via a contrastive-difference analysis, builds a top- binary mask, and applies an adaptive, input-aligned update with strength to the last-token activations across layers, heads, or FFNs. Across four backbones and eleven tasks, SADI substantially outperforms fixed steering and random interventions, with notable gains on attention-head interventions and robust generalization to multilingual and few-shot settings. The method requires only ~150 contrastive examples to build the mask and does not require training, offering a cost-effective, broadly applicable activation-intervention technique for LLM alignment with strong practical potential.

Abstract

Large language models (LLMs) have achieved remarkable performance across many tasks, yet aligning them with desired behaviors remains challenging. Activation intervention has emerged as an effective and economical method to modify the behavior of LLMs. Despite considerable interest in this area, current intervention methods exclusively employ a fixed steering vector to modify model activations, lacking adaptability to diverse input semantics. To address this limitation, we propose Semantics-Adaptive Dynamic Intervention (SADI), a novel method that constructs a dynamic steering vector to intervene model activations at inference time. More specifically, SADI utilizes activation differences in contrastive pairs to precisely identify critical elements of an LLM (i.e., attention heads, hidden states, and neurons) for targeted intervention. During inference, SADI dynamically steers model behavior by scaling element-wise activations based on the directions of input semantics. Experimental results show that SADI outperforms established baselines by substantial margins, improving task performance without training. SADI's cost-effectiveness and generalizability across various LLM backbones and tasks highlight its potential as a versatile alignment technique.

Paper Structure

This paper contains 40 sections, 6 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Three steps of SADI: (1) Difference Extraction: extract the activation differences between positive and negative examples from all model layers; (2) Binary Masking: compute the mean activation difference to locate the key elements and produce an identification mask by binarization; and (3) Adaptive Steering: intervene the activations during inference by applying the identification mask to the input activations scaled by a factor of $\delta$.
  • Figure 2: Results with varying intervention strength and numbers of key attention heads based on COPA, StoryCloze, SST2, TriviaQA tasks with LLaMA2-7b-chat.
  • Figure 3: Activation difference of each head across layers and the distribution of top-100 activation difference of neurons and hidden states with LLaMA2-7b-chat in StoryCloze.
  • Figure 4: Relationship between accuracy and the amount of contrastive pairs.
  • Figure 5: Overlap of identified key elements across various tasks. From 1 to 10 represents the tasks: COPA, StoryCloze, SST2, BoolQ, MMLU, NLI, Winogrande, TriviaQA, ToxiGen, TruthfulQA.
  • ...and 2 more figures