Table of Contents
Fetching ...

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

Li-Ming Zhan, Bo Liu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu

TL;DR

REAL identifies behavior-relevant Transformer modules by learning a per-head vector-quantized autoencoder to disentangle behavior-related latent factors, then uses a learned autoregressive prior over the resulting code sequences to score and select heads. The selected heads are steered with weights proportional to their discriminative scores, improving inference-time control over truthfulness, knowledge selection, and general alignment across multiple models and datasets. The approach achieves about 20% average relative gains (up to 81.5%) over baselines on TruthfulQA and demonstrates zero-shot transfer across domains, addressing polysemantic activations with a data- and computation-efficient framework. Overall, REAL provides a principled, scalable solution for module-level interventions in LLM steering with strong cross-domain generalization potential.

Abstract

Inference-time steering aims to alter a large language model's (LLM's) responses without changing its parameters, but a central challenge is identifying the internal modules that most strongly govern the target behavior. Existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. We introduce REAL, a framework for identifying behavior-relevant modules (attention heads or layers) in Transformer models. For each module, REAL trains a vector-quantized autoencoder (VQ-AE) on its hidden activations and uses a shared, learnable codebook to partition the latent space into behavior-relevant and behavior-irrelevant subspaces. REAL quantifies a module's behavioral relevance by how well its VQ-AE encodings discriminate behavior-aligned from behavior-violating responses via a binary classification metric; this score guides both module selection and steering strength. We evaluate REAL across eight LLMs from the Llama and Qwen families and nine datasets spanning truthfulness enhancement, open-domain QA under knowledge conflicts, and general alignment tasks. REAL enables more effective inference-time interventions, achieving an average relative improvement of 20% (up to 81.5%) over the ITI method on truthfulness steering. In addition, the modules selected by REAL exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

TL;DR

REAL identifies behavior-relevant Transformer modules by learning a per-head vector-quantized autoencoder to disentangle behavior-related latent factors, then uses a learned autoregressive prior over the resulting code sequences to score and select heads. The selected heads are steered with weights proportional to their discriminative scores, improving inference-time control over truthfulness, knowledge selection, and general alignment across multiple models and datasets. The approach achieves about 20% average relative gains (up to 81.5%) over baselines on TruthfulQA and demonstrates zero-shot transfer across domains, addressing polysemantic activations with a data- and computation-efficient framework. Overall, REAL provides a principled, scalable solution for module-level interventions in LLM steering with strong cross-domain generalization potential.

Abstract

Inference-time steering aims to alter a large language model's (LLM's) responses without changing its parameters, but a central challenge is identifying the internal modules that most strongly govern the target behavior. Existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. We introduce REAL, a framework for identifying behavior-relevant modules (attention heads or layers) in Transformer models. For each module, REAL trains a vector-quantized autoencoder (VQ-AE) on its hidden activations and uses a shared, learnable codebook to partition the latent space into behavior-relevant and behavior-irrelevant subspaces. REAL quantifies a module's behavioral relevance by how well its VQ-AE encodings discriminate behavior-aligned from behavior-violating responses via a binary classification metric; this score guides both module selection and steering strength. We evaluate REAL across eight LLMs from the Llama and Qwen families and nine datasets spanning truthfulness enhancement, open-domain QA under knowledge conflicts, and general alignment tasks. REAL enables more effective inference-time interventions, achieving an average relative improvement of 20% (up to 81.5%) over the ITI method on truthfulness steering. In addition, the modules selected by REAL exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

Paper Structure

This paper contains 59 sections, 15 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Examples of Llama2-7B-Chat responses to TruthfulQA questions using different steering methods: the no-steering baseline (Original), ITIDBLP:conf/nips/0002PVPW23, and our proposed REAL method. REAL demonstrates three targeted behaviors: faithful enrichment without contradiction, factual accuracy (rejection of myths/falsehoods), and calibrated refusal that avoids confabulation while adhering to the target policy.
  • Figure 2: The top 48 attention heads in Llama2-7B-Chat identified by ITI and REAL, based on TruthfulQA.
  • Figure 3: Overview of the proposed REAL framework. We use activations from each attention head to train a VQ-AE, aiming to learn a disentangled, quantized latent space. The VQ-AE is trained using a latent contrastive loss in conjunction with the standard VQ loss. The discrete encodings produced by the VQ-AE are then used to train a scoring function that outputs the probability of a given encoding corresponding to the target behavior. Finally, a binary classification metric, such as the area under the ROC curve (AUC-ROC), is employed to determine the behavioral-relevance score for each head.
  • Figure 4: t-SNE visualization comparing the highest-performing head (11th layer, 22nd head; top row) and the lowest-performing head (30th layer, 27th head; bottom row) of Llama2-7B on TruthfulQA.
  • Figure 5: Heatmaps of behavior-relevance scores for each attention head on TruthfulQA across four models.
  • ...and 1 more figures