Table of Contents
Fetching ...

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

Yu Zhao, Alessio Devoto, Giwon Hong, Xiaotang Du, Aryo Pradipta Gema, Hongru Wang, Xuanli He, Kam-Fai Wong, Pasquale Minervini

TL;DR

This work tackles context-memory knowledge conflicts in LLMs, where contextual information may clash with learned parametric knowledge. It introduces SpARE, a training-free SAE-based framework that identifies a small set of monosemantic features in mid-layer activations and edits hidden states at inference time to steer whether the model relies on contextual or parametric knowledge. Empirical results on open-domain QA show that SpARE outperforms both representation-engineering baselines and contrastive decoding methods, with gains of approximately +10% to +15% in steering accuracy and effective operation across multiple models and datasets. The approach enables efficient, inference-time control over knowledge selection and offers a practical path to mitigating knowledge conflicts without retraining, while revealing insights into the residual-stream dynamics and layer-specific steerability.

Abstract

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods ($+10\%$) as well as contrastive decoding methods ($+15\%$).

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

TL;DR

This work tackles context-memory knowledge conflicts in LLMs, where contextual information may clash with learned parametric knowledge. It introduces SpARE, a training-free SAE-based framework that identifies a small set of monosemantic features in mid-layer activations and edits hidden states at inference time to steer whether the model relies on contextual or parametric knowledge. Empirical results on open-domain QA show that SpARE outperforms both representation-engineering baselines and contrastive decoding methods, with gains of approximately +10% to +15% in steering accuracy and effective operation across multiple models and datasets. The approach enables efficient, inference-time control over knowledge selection and offers a practical path to mitigating knowledge conflicts without retraining, while revealing insights into the residual-stream dynamics and layer-specific steerability.

Abstract

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as \emph{context-memory knowledge conflicts}, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use \emph{inference-time} intervention strategies to resolve it. In this work, we propose \textsc{SpARE}, a \emph{training-free} representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. \textsc{SpARE} identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that \textsc{SpARE} can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods () as well as contrastive decoding methods ().

Paper Structure

This paper contains 53 sections, 11 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: In the event of a knowledge conflict, the model can rely on the context or on the parametric knowledge. The figure presents the predictions of Llama2-7B steered by SpARE.
  • Figure 2: The knowledge conflict probing results of Llama2-7B and Gemma2-9B on NQSwap nqswap. The probing results on hidden states, MLP and Self-Attention activations are coloured differently.
  • Figure 3: The workflow of SpARE steers the knowledge selection behaviour. The figure presents an example of steering the model to use parametric knowledge. First, the SAE encoder $f_\theta$ encodes hidden state $\mathbf{h}$ into the SAE activation $\mathbf{z}$. Then, it determines the values of SAE activations $\mathbf{z}^{-}$ and $\mathbf{z}^{+}$ for editing (\ref{['eq:remove-act']} and \ref{['eq:add-act']}). Finally, we edit the hidden state using the features extracted from the SAE decoder $g_\phi$ (\ref{['eq:hidden-edit']}).
  • Figure 4: Detailed evaluation results of controlling capability on NQSwap. We use different colours for different methods and use different shapes for different models. The upper-right area indicates a high performance for all figures. (a) presents the capability of changing the behaviour of LLMs, where $x$-axis and $y$-axis are $\text{EM}_{C \rightarrow M}$ and $\text{EM}_{M \rightarrow C}$, measuring the capability of changing the answer from $C$ to $M$ and from $M$ to $C$, respectively; (b) presents the capability of maintaining the behaviour when steering to the same behaviour as the original behaviour, where $x$-axis and $y$-axis are $\text{EM}_{M \rightarrow M}$ and $\text{EM}_{C \rightarrow C}$, measuring the maintaining capability of generating $M$ and $C$, respectively; (c) present the ablation analysis of SpARE, $x$-axis and $y$-axis are $\text{EM}_{M}$ and $\text{EM}_{C}$.
  • Figure 5: Effectiveness of SpARE on editing different layers individually.
  • ...and 14 more figures