Table of Contents
Fetching ...

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu

TL;DR

The paper identifies a frequency bias in sparse autoencoder explanations of LLMs, where linguistic patterns overshadow discourse topics. It introduces a mutual information-based objective over a fixed vocabulary to extract discursive, semantically meaningful explanations and proposes two runtime steering strategies (Amplification and Calibration) to modulate LLM behavior using these explanations. Empirical results show discourse-level explanations outperform baselines and can effectively defend against jailbreak attacks with minimal impact on general helpfulness. The work highlights the practical value of explanation-driven steering for safer and more controllable LLM deployments.

Abstract

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

TL;DR

The paper identifies a frequency bias in sparse autoencoder explanations of LLMs, where linguistic patterns overshadow discourse topics. It introduces a mutual information-based objective over a fixed vocabulary to extract discursive, semantically meaningful explanations and proposes two runtime steering strategies (Amplification and Calibration) to modulate LLM behavior using these explanations. Empirical results show discourse-level explanations outperform baselines and can effectively defend against jailbreak attacks with minimal impact on general helpfulness. The work highlights the practical value of explanation-driven steering for safer and more controllable LLM deployments.

Abstract

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.

Paper Structure

This paper contains 37 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Examples of explanations generated by ours and baseline methods. We separate raw extracted words/spans with ";" and boldface their automated summaries. We could observe that our method tends to use diverse words to describe a semantical concept. In contrast, the extracted spans from baseline methods typically share some duplicated phrases, indicating suffering from a frequency bias on those linguistic patterns.
  • Figure 2: The proposed framework of explaining SAE features and steering LLMs with explanations.
  • Figure 3: Applying Aware Security for jailbreak defense based on explanations from different methods.
  • Figure 4: A case study on steering LLMs to defense jailbreak attack by using Aware Security (AS). We can observe that by enhancing security contents in LLM representations (i.e., larger $\beta$), their responses provide safer suggestions (starting from blow up anything, switching to blow up food, ending with cannot blow up).
  • Figure 5: Applying Aware Security for jailbreak defense based on our explanations in different layers.