Table of Contents
Fetching ...

Group Crosscoders for Mechanistic Analysis of Symmetry

Liv Gorton

TL;DR

It is demonstrated that group crosscoders can provide systematic insights into how neural networks represent symmetry, offering a promising new tool for mechanistic interpretability.

Abstract

We introduce group crosscoders, an extension of crosscoders that systematically discover and analyse symmetrical features in neural networks. While neural networks often develop equivariant representations without explicit architectural constraints, understanding these emergent symmetries has traditionally relied on manual analysis. Group crosscoders automate this process by performing dictionary learning across transformed versions of inputs under a symmetry group. Applied to InceptionV1's mixed3b layer using the dihedral group $\mathrm{D}_{32}$, our method reveals several key insights: First, it naturally clusters features into interpretable families that correspond to previously hypothesised feature types, providing more precise separation than standard sparse autoencoders. Second, our transform block analysis enables the automatic characterisation of feature symmetries, revealing how different geometric features (such as curves versus lines) exhibit distinct patterns of invariance and equivariance. These results demonstrate that group crosscoders can provide systematic insights into how neural networks represent symmetry, offering a promising new tool for mechanistic interpretability.

Group Crosscoders for Mechanistic Analysis of Symmetry

TL;DR

It is demonstrated that group crosscoders can provide systematic insights into how neural networks represent symmetry, offering a promising new tool for mechanistic interpretability.

Abstract

We introduce group crosscoders, an extension of crosscoders that systematically discover and analyse symmetrical features in neural networks. While neural networks often develop equivariant representations without explicit architectural constraints, understanding these emergent symmetries has traditionally relied on manual analysis. Group crosscoders automate this process by performing dictionary learning across transformed versions of inputs under a symmetry group. Applied to InceptionV1's mixed3b layer using the dihedral group , our method reveals several key insights: First, it naturally clusters features into interpretable families that correspond to previously hypothesised feature types, providing more precise separation than standard sparse autoencoders. Second, our transform block analysis enables the automatic characterisation of feature symmetries, revealing how different geometric features (such as curves versus lines) exhibit distinct patterns of invariance and equivariance. These results demonstrate that group crosscoders can provide systematic insights into how neural networks represent symmetry, offering a promising new tool for mechanistic interpretability.

Paper Structure

This paper contains 15 sections, 10 equations, 5 figures.

Figures (5)

  • Figure 1: A representation of how a dataset example for $\mathrm{D}_8$ would be structured. Each octagon represents one set of activations from the source model. Rotations rotate in a counter-clockwise direction, whereas the reflections rotate in a clockwise direction.
  • Figure 2: Using the rotated portion of $\mathrm{D}_8$ as an example, we can conceptualise the 1D vector as a circle.
  • Figure 3: A 4D UMAP of group crosscoder features with the precomputed distance matrix from section \ref{['section:distance-matrix']} followed by a 2D UMAP with a cosine similarity metric. Distinct clusters of related features can be seen with structure emerging within some clusters, e.g., the "divots and corners" cluster.
  • Figure 4: The dictionary vectors of a regular sparse autoencoder trained on the entirety of mixed3b following the methodology described in gorton2024missingcurvedetectorsinceptionv1. Following the same UMAP procedure in \ref{['figure:feature-umap']} except using cosine similarity for the metric for both the 4D and the 2D UMAPs.
  • Figure 5: The cosine similarity of each block within three different features. For each feature, the top left and bottom right correspond to rotations, and the top right and bottom left correspond to reflections.