Table of Contents
Fetching ...

Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

Daniel Zhao, Abhilash Shankarampeta, Lanxiang Hu, Tajana Rosing, Hao Zhang

TL;DR

The paper tackles the challenge of mechanistically supervising inference-time reasoning in large language models by proposing a sparse autoencoder (SAE)–based representation pipeline that clusters token activations and builds a graph of latent transitions. It defines an edge-weighted reward R(p) to quantify adherence to established reasoning traces and to balance exploitation of high-probability paths with exploration of novel trajectories, enabling scalable guidance for intermediate token generation. Empirical results across math-centric tasks show that exploiting the graph-based reward alone is insufficient; a balanced exploitation–exploration strategy yields superior accuracy and structural alignment, with multiple metrics (DTW, KL divergence, entropy) used to validate generation quality and diversity. The approach offers a scalable, interpretable mechanism for inference-time supervision of reasoning in LLMs, with potential benefits for more efficient RL training and more robust long-form CoT generation in practical applications.

Abstract

We propose a novel method that leverages sparse autoencoders (SAEs) and clustering techniques to analyze the internal token representations of large language models (LLMs) and guide generations in mathematical reasoning tasks. Our approach first trains an SAE to generate sparse vector representations for training tokens, then applies k-means clustering to construct a graph where vertices represent token clusters and weighted edges capture sequential token transitions. Using this graph, we define an edge-weight based reward function to quantify adherence to established reasoning traces, thereby identifying exploitative reasoning trajectories. Additionally, we measure generation diversity from clustering to assess the extent of exploration. Our findings indicate that balancing both exploitation and exploration is crucial for achieving high accuracy in mathematical reasoning tasks. During generation, the SAE can serve as a scalable reward model to guide generations, ensuring a balanced trade-off between exploitation and exploration. This prevents extreme behaviors in either direction, ultimately fostering a higher-quality reasoning process in LLMs.

Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

TL;DR

The paper tackles the challenge of mechanistically supervising inference-time reasoning in large language models by proposing a sparse autoencoder (SAE)–based representation pipeline that clusters token activations and builds a graph of latent transitions. It defines an edge-weighted reward R(p) to quantify adherence to established reasoning traces and to balance exploitation of high-probability paths with exploration of novel trajectories, enabling scalable guidance for intermediate token generation. Empirical results across math-centric tasks show that exploiting the graph-based reward alone is insufficient; a balanced exploitation–exploration strategy yields superior accuracy and structural alignment, with multiple metrics (DTW, KL divergence, entropy) used to validate generation quality and diversity. The approach offers a scalable, interpretable mechanism for inference-time supervision of reasoning in LLMs, with potential benefits for more efficient RL training and more robust long-form CoT generation in practical applications.

Abstract

We propose a novel method that leverages sparse autoencoders (SAEs) and clustering techniques to analyze the internal token representations of large language models (LLMs) and guide generations in mathematical reasoning tasks. Our approach first trains an SAE to generate sparse vector representations for training tokens, then applies k-means clustering to construct a graph where vertices represent token clusters and weighted edges capture sequential token transitions. Using this graph, we define an edge-weight based reward function to quantify adherence to established reasoning traces, thereby identifying exploitative reasoning trajectories. Additionally, we measure generation diversity from clustering to assess the extent of exploration. Our findings indicate that balancing both exploitation and exploration is crucial for achieving high accuracy in mathematical reasoning tasks. During generation, the SAE can serve as a scalable reward model to guide generations, ensuring a balanced trade-off between exploitation and exploration. This prevents extreme behaviors in either direction, ultimately fostering a higher-quality reasoning process in LLMs.

Paper Structure

This paper contains 25 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Flow of our pipeline
  • Figure 2: Sample construction of graph with the sentence "3 + 4 + 7 apples". Left shows cluster assignment of tokens. Right shows conversion to edges and nodes.
  • Figure 3: Graph showing example exploit vs. explore reasoning trajectories. Exploit takes the greediest approach, only transitioning through high weight edges. Explore takes lower weight edges but arrives at the same correct answer.
  • Figure 4: Distribution of SAE cosine similarities between consecutive sequence elements across different models. Each subplot compares distributions between correctly generated sequences (green), incorrectly generated sequences (orange), and original sequences (blue). (a) MiniCPM-1B-sft shows relatively lower peak density but better alignment between correct and original distribution, (b) MiniCPM-2B-sft demonstrates higher peak density with better distributional alignment, and (c) MiniCPM-2B-128k exhibits the highest peak density but shows notable deviation from the original distribution pattern. The curves represent kernel density estimations, revealing the underlying probability distribution of the cosine similarities. Higher similarity values indicate stronger semantic relationships between consecutive elements in the sequences.
  • Figure 5: Distribution of centroid cosine similarities between consecutive sequence elements across different models. Each subplot compares distributions between correctly generated sequences (green), incorrectly generated sequences (orange), and original sequences (blue). (a) MiniCPM-1B-sft shows relatively lower peak density but better alignment between correct and original distribution, (b) MiniCPM-2B-sft demonstrates higher peak density with better distributional alignment, and (c) MiniCPM-2B-128k exhibits the highest peak density but shows notable deviation from the original distribution pattern. This centroid-based analysis complements the fine-grained SAE similarity distributions shown in Figure \ref{['fig:sae_cos_sim_hist']}.