LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models
Jingyuan Yang, Rongjun Li, Weixuan Wang, Ziyu Zhou, Zhiyong Feng, Wei Peng
TL;DR
This work tackles semantic inconsistency in LLMs by shifting from component-level activation steering to a fine-grained, feature-level approach. LF-Steering maps a selected transformer layer's hidden states into a sparse, high-dimensional feature space via a TopK sparse autoencoder, identifies consistency-relevant features through a contrastive locating process, and performs inference-time steering by injecting a bias on the activated features, controlled by a strength parameter $\alpha$. The method achieves state-of-the-art semantic consistency across NLU and NLG tasks, with substantial improvements over baselines and prior activation-steering methods, while preserving locality on out-of-domain data. The results demonstrate that decoupled latent features offer precise control with minimal interference, and the authors outline future work to extend steering across multiple transformer layers to further enhance semantic consistency.
Abstract
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention head outputs. They face a challenge due to the ``polysemanticity issue'', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder (SAE), ensuring model steering based on decoupled feature representations with minimal interference. Comprehensive experiments on NLU and NLG datasets demonstrate the effectiveness of our method in enhancing semantic consistency, resulting in significant performance gains for various NLU and NLG tasks.
