Sparse Crosscoders for diffing MoEs and Dense models

Marmik Chaudhari; Nishkal Hundia; Idhant Gulati

Sparse Crosscoders for diffing MoEs and Dense models

Marmik Chaudhari, Nishkal Hundia, Idhant Gulati

TL;DR

This work presents a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders that jointly models multiple activation spaces that jointly models multiple activation spaces.

Abstract

Mixture of Experts (MoE) achieve parameter-efficient scaling through sparse expert routing, yet their internal representations remain poorly understood compared to dense models. We present a systematic comparison of MoE and dense model internals using crosscoders, a variant of sparse autoencoders, that jointly models multiple activation spaces. We train 5-layer dense and MoEs (equal active parameters) on 1B tokens across code, scientific text, and english stories. Using BatchTopK crosscoders with explicitly designated shared features, we achieve $\sim 87\%$ fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features.

Sparse Crosscoders for diffing MoEs and Dense models

TL;DR

Abstract

fractional variance explained and uncover concrete differences in feature organization. The MoE learns significantly fewer unique features compared to the dense model. MoE-specific features also exhibit higher activation density than shared features, whereas dense-specific features show lower density. Our analysis reveals that MoEs develop more specialized, focused representations while dense models distribute information across broader, more general-purpose features.

Paper Structure (7 sections, 5 equations, 3 figures, 1 table)

This paper contains 7 sections, 5 equations, 3 figures, 1 table.

Introduction
Background
Mixture of Experts
Crosscoders
Methods
Results
Conclusion

Figures (3)

Figure 1: Fractional variance explained of model activations across 40k training steps
Figure 2: Relative difference of decoder norm vectors for features in different models. MoE specific features on the left $(<0.3)$ and Dense specific features on the right $(>0.7)$.
Figure 3: (Left) Comparison between cosine similarity of decoder vectors between the MoE and dense model and (Right) feature densities of the shared, dense and MoE-specific features where x-axis shows the activation frequency of features and y-axis shows the density of features.

Sparse Crosscoders for diffing MoEs and Dense models

TL;DR

Abstract

Sparse Crosscoders for diffing MoEs and Dense models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)