REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

Li-Ming Zhan; Bo Liu; Chengqiang Xie; Jiannong Cao; Xiao-Ming Wu

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

Li-Ming Zhan, Bo Liu, Chengqiang Xie, Jiannong Cao, Xiao-Ming Wu

TL;DR

REAL identifies behavior-relevant Transformer modules by learning a per-head vector-quantized autoencoder to disentangle behavior-related latent factors, then uses a learned autoregressive prior over the resulting code sequences to score and select heads. The selected heads are steered with weights proportional to their discriminative scores, improving inference-time control over truthfulness, knowledge selection, and general alignment across multiple models and datasets. The approach achieves about 20% average relative gains (up to 81.5%) over baselines on TruthfulQA and demonstrates zero-shot transfer across domains, addressing polysemantic activations with a data- and computation-efficient framework. Overall, REAL provides a principled, scalable solution for module-level interventions in LLM steering with strong cross-domain generalization potential.

Abstract

Inference-time steering aims to alter a large language model's (LLM's) responses without changing its parameters, but a central challenge is identifying the internal modules that most strongly govern the target behavior. Existing approaches often rely on simplistic cues or ad hoc heuristics, leading to suboptimal or unintended effects. We introduce REAL, a framework for identifying behavior-relevant modules (attention heads or layers) in Transformer models. For each module, REAL trains a vector-quantized autoencoder (VQ-AE) on its hidden activations and uses a shared, learnable codebook to partition the latent space into behavior-relevant and behavior-irrelevant subspaces. REAL quantifies a module's behavioral relevance by how well its VQ-AE encodings discriminate behavior-aligned from behavior-violating responses via a binary classification metric; this score guides both module selection and steering strength. We evaluate REAL across eight LLMs from the Llama and Qwen families and nine datasets spanning truthfulness enhancement, open-domain QA under knowledge conflicts, and general alignment tasks. REAL enables more effective inference-time interventions, achieving an average relative improvement of 20% (up to 81.5%) over the ITI method on truthfulness steering. In addition, the modules selected by REAL exhibit strong zero-shot generalization in cross-domain truthfulness-steering scenarios.

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

TL;DR

Abstract

REAL: Reading Out Transformer Activations for Precise Localization in Language Model Steering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)