Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

Daniel Zhao; Abhilash Shankarampeta; Lanxiang Hu; Tajana Rosing; Hao Zhang

Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

Daniel Zhao, Abhilash Shankarampeta, Lanxiang Hu, Tajana Rosing, Hao Zhang

TL;DR

The paper tackles the challenge of mechanistically supervising inference-time reasoning in large language models by proposing a sparse autoencoder (SAE)–based representation pipeline that clusters token activations and builds a graph of latent transitions. It defines an edge-weighted reward R(p) to quantify adherence to established reasoning traces and to balance exploitation of high-probability paths with exploration of novel trajectories, enabling scalable guidance for intermediate token generation. Empirical results across math-centric tasks show that exploiting the graph-based reward alone is insufficient; a balanced exploitation–exploration strategy yields superior accuracy and structural alignment, with multiple metrics (DTW, KL divergence, entropy) used to validate generation quality and diversity. The approach offers a scalable, interpretable mechanism for inference-time supervision of reasoning in LLMs, with potential benefits for more efficient RL training and more robust long-form CoT generation in practical applications.

Abstract

We propose a novel method that leverages sparse autoencoders (SAEs) and clustering techniques to analyze the internal token representations of large language models (LLMs) and guide generations in mathematical reasoning tasks. Our approach first trains an SAE to generate sparse vector representations for training tokens, then applies k-means clustering to construct a graph where vertices represent token clusters and weighted edges capture sequential token transitions. Using this graph, we define an edge-weight based reward function to quantify adherence to established reasoning traces, thereby identifying exploitative reasoning trajectories. Additionally, we measure generation diversity from clustering to assess the extent of exploration. Our findings indicate that balancing both exploitation and exploration is crucial for achieving high accuracy in mathematical reasoning tasks. During generation, the SAE can serve as a scalable reward model to guide generations, ensuring a balanced trade-off between exploitation and exploration. This prevents extreme behaviors in either direction, ultimately fostering a higher-quality reasoning process in LLMs.

Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

TL;DR

Abstract

Towards Interpretable and Inference-Optimal COT Reasoning with Sparse Autoencoder-Guided Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)