Sparsification and Reconstruction from the Perspective of Representation Geometry
Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu
TL;DR
This work investigates how Sparse Autoencoders shape activation representations in language models from a representation-geometry perspective. It introduces SAEMA, a three-step framework that analyzes latent tensors under noise via rank-variability of symmetric semipositive definite (SSPD) matrices, revealing stratified manifold structure across concepts. By defining local/global representations and associated geometry metrics (Avg. ID, Betti 0, MSTW, AGD, Procrustes disparity), the study shows sparse encoding increases local dimensionality and substructure while compressing global structure, supporting feature disentanglement. An optimization-based intervention demonstrates that increasing separability among local representations causally improves SAE reconstruction (lower MSE), underscoring the practical value of geometric constraints for SAE design and interpretability. The findings provide a principled link between representational geometry and reconstruction fidelity, with implications for developing new interpretable tools and SAE-based editing of language-model activations.
Abstract
Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.
