Table of Contents
Fetching ...

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao

TL;DR

Step-level sparse autoencoder (SSAE) is proposed, which serves as an analytical tool to disentangle different aspects of LLMs'reasoning steps into sparse features and forms an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions.

Abstract

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

TL;DR

Step-level sparse autoencoder (SSAE) is proposed, which serves as an analytical tool to disentangle different aspects of LLMs'reasoning steps into sparse features and forms an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions.

Abstract

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. The code is available at https://github.com/Miaow-Lab/SSAE
Paper Structure (13 sections, 10 equations, 4 figures, 5 tables)

This paper contains 13 sections, 10 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: We evaluate two step-level tasks: first-token prediction and sentence-length prediction, representing the directional and depth-wise characteristics of a reasoning step. The y-axis reports relative metrics (PPL / RMSE) compared to a statistical baseline (lower is better). Token-based SAEs fail to capture such step-level information, while our SSAE achieves accurate prediction.
  • Figure 2: Overview of our SSAE framework.
  • Figure 3: Taxonomy and statistical distribution of N2G patterns across various SSAE configurations and datasets.
  • Figure 4: Case studies on SSAE feature perturbations. Modulating shared dimensions induces surface-level linguistic variations, whereas the exchange of unique dimensions triggers a crossover of underlying reasoning strategies.