Table of Contents
Fetching ...

LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models

Jingyuan Yang, Rongjun Li, Weixuan Wang, Ziyu Zhou, Zhiyong Feng, Wei Peng

TL;DR

This work tackles semantic inconsistency in LLMs by shifting from component-level activation steering to a fine-grained, feature-level approach. LF-Steering maps a selected transformer layer's hidden states into a sparse, high-dimensional feature space via a TopK sparse autoencoder, identifies consistency-relevant features through a contrastive locating process, and performs inference-time steering by injecting a bias on the activated features, controlled by a strength parameter $\alpha$. The method achieves state-of-the-art semantic consistency across NLU and NLG tasks, with substantial improvements over baselines and prior activation-steering methods, while preserving locality on out-of-domain data. The results demonstrate that decoupled latent features offer precise control with minimal interference, and the authors outline future work to extend steering across multiple transformer layers to further enhance semantic consistency.

Abstract

Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention head outputs. They face a challenge due to the ``polysemanticity issue'', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder (SAE), ensuring model steering based on decoupled feature representations with minimal interference. Comprehensive experiments on NLU and NLG datasets demonstrate the effectiveness of our method in enhancing semantic consistency, resulting in significant performance gains for various NLU and NLG tasks.

LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models

TL;DR

This work tackles semantic inconsistency in LLMs by shifting from component-level activation steering to a fine-grained, feature-level approach. LF-Steering maps a selected transformer layer's hidden states into a sparse, high-dimensional feature space via a TopK sparse autoencoder, identifies consistency-relevant features through a contrastive locating process, and performs inference-time steering by injecting a bias on the activated features, controlled by a strength parameter . The method achieves state-of-the-art semantic consistency across NLU and NLG tasks, with substantial improvements over baselines and prior activation-steering methods, while preserving locality on out-of-domain data. The results demonstrate that decoupled latent features offer precise control with minimal interference, and the authors outline future work to extend steering across multiple transformer layers to further enhance semantic consistency.

Abstract

Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs. Recently, activation steering, a technique that modulates LLMs' behaviours by adjusting their latent representations during inference time, has been explored to improve the semantic consistency of LLMs. However, these methods typically operate at the model component level, such as layer hidden states or attention head outputs. They face a challenge due to the ``polysemanticity issue'', where the model components of LLMs typically encode multiple entangled features, making precise steering difficult. To address this challenge, we drill down to feature-level representations and propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency. More specifically, our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder (SAE), ensuring model steering based on decoupled feature representations with minimal interference. Comprehensive experiments on NLU and NLG datasets demonstrate the effectiveness of our method in enhancing semantic consistency, resulting in significant performance gains for various NLU and NLG tasks.
Paper Structure (18 sections, 3 equations, 5 figures, 6 tables)

This paper contains 18 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Semantic inconsistency in LLMs and how a "grey-box" approach based on activation steering addresses this problem.
  • Figure 2: The flowchart of our method. (1) Consistency Relevant Features Locating consists two steps. We first use the method proposed by yang-etal-2024-enhancing to locate the top-1 transformer layer and then pretrain a SAE and use it to locate the key features responsible for the semantic inconsistencies in the LLM. (2) Steering the LLM towards greater semantic consistency by adjusting the values of the identified key features.
  • Figure 3: Comparison of layer-wise locating accuracy across 32 transformer layers for the experiment datasets.
  • Figure 4: Performance of our proposed activation steering method across different threshold values.
  • Figure 5: Performance of our proposed activation steering method across different feature activation values.