Table of Contents
Fetching ...

Sparsification and Reconstruction from the Perspective of Representation Geometry

Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu

TL;DR

This work investigates how Sparse Autoencoders shape activation representations in language models from a representation-geometry perspective. It introduces SAEMA, a three-step framework that analyzes latent tensors under noise via rank-variability of symmetric semipositive definite (SSPD) matrices, revealing stratified manifold structure across concepts. By defining local/global representations and associated geometry metrics (Avg. ID, Betti 0, MSTW, AGD, Procrustes disparity), the study shows sparse encoding increases local dimensionality and substructure while compressing global structure, supporting feature disentanglement. An optimization-based intervention demonstrates that increasing separability among local representations causally improves SAE reconstruction (lower MSE), underscoring the practical value of geometric constraints for SAE design and interpretability. The findings provide a principled link between representational geometry and reconstruction fidelity, with implications for developing new interpretable tools and SAE-based editing of language-model activations.

Abstract

Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.

Sparsification and Reconstruction from the Perspective of Representation Geometry

TL;DR

This work investigates how Sparse Autoencoders shape activation representations in language models from a representation-geometry perspective. It introduces SAEMA, a three-step framework that analyzes latent tensors under noise via rank-variability of symmetric semipositive definite (SSPD) matrices, revealing stratified manifold structure across concepts. By defining local/global representations and associated geometry metrics (Avg. ID, Betti 0, MSTW, AGD, Procrustes disparity), the study shows sparse encoding increases local dimensionality and substructure while compressing global structure, supporting feature disentanglement. An optimization-based intervention demonstrates that increasing separability among local representations causally improves SAE reconstruction (lower MSE), underscoring the practical value of geometric constraints for SAE design and interpretability. The findings provide a principled link between representational geometry and reconstruction fidelity, with implications for developing new interpretable tools and SAE-based editing of language-model activations.

Abstract

Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature distinctions by merging similar semantic features and introducing additional dimensionality. Furthermore, we intervene the global representation from an optimization perspective, proving a significant causal relationship between their separability and the reconstruction performance. This study explains the principles of sparsity from the perspective of representational geometry and demonstrates the impact of changes in representational structure on reconstruction performance. Particularly emphasizes the necessity of understanding representations and incorporating representational constraints, providing empirical references for developing new interpretable tools and improving SAEs. The code is available at \hyperlink{https://github.com/wenjie1835/SAERepGeo}{https://github.com/wenjie1835/SAERepGeo}.

Paper Structure

This paper contains 46 sections, 24 equations, 5 figures, 8 tables, 3 algorithms.

Figures (5)

  • Figure 1: The workflow of SAEManifoldAnalyzer (SAEMA)
  • Figure 2: The changes of $r_{3}$ and AGD corresponding to different concepts encoded by pre-trained SAEs under different noise levels.
  • Figure 3: Variability of $d_{GW}$ and AEDP with increasing $\alpha$ during the optimization of Equation \ref{['eq:gw_loss']}.
  • Figure 4: Variability of $d_{GW}$ and AEDP with MSE for different $\alpha$ during the optimization of Equation \ref{['eq:gw_loss']}.
  • Figure 5: First column: Variability of MSE with AEDP as $\alpha$ increases during the optimization of Equation \ref{['eq: AEDP_loss']}; Second column: Contribution of the $AEDP^{-1}$ term to the Total Loss during the optimization of Equation \ref{['eq: AEDP_loss']}.