Table of Contents
Fetching ...

Towards a unified and verified understanding of group-operation networks

Wilson Wu, Louis Jaburi, Jacob Drori, Jason Gross

TL;DR

This work tackles the interpretability of one-hidden-layer networks trained to perform the binary operation on finite groups, addressing a gap in rigorous evaluation of mechanistic explanations. It reveals that trained models implement bi-equivariant circuitry organized around irreducible representations and $\rho$-sets, unifying prior accounts of coset concentration and irrep sparsity into a single, testable framework. By translating these mechanistic insights into compact proofs of model performance, the authors demonstrate nontrivial accuracy guarantees (and faster verification) for a majority of models, especially on the symmetric group $S_5$, while also highlighting cases where the explanation is incomplete. This approach provides a quantitative, scalable method to assess explanations, advance a cohesive theory of interpretability in algorithmic neural computations, and offers reproducible experiments and code for broader adoption.

Abstract

A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models in a step towards unifying the explanations of previous works (Chughtai et al., 2023; Stander et al., 2024). Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of the extent to which we faithfully and concisely explain model internals. In the main text, we focus on the symmetric group S5. For models trained on this group, our explanation yields a guarantee of model accuracy that runs 3x faster than brute force and gives a >=95% accuracy bound for 45% of the models we trained. We were unable to obtain nontrivial non-vacuous accuracy bounds using only explanations from previous works.

Towards a unified and verified understanding of group-operation networks

TL;DR

This work tackles the interpretability of one-hidden-layer networks trained to perform the binary operation on finite groups, addressing a gap in rigorous evaluation of mechanistic explanations. It reveals that trained models implement bi-equivariant circuitry organized around irreducible representations and -sets, unifying prior accounts of coset concentration and irrep sparsity into a single, testable framework. By translating these mechanistic insights into compact proofs of model performance, the authors demonstrate nontrivial accuracy guarantees (and faster verification) for a majority of models, especially on the symmetric group , while also highlighting cases where the explanation is incomplete. This approach provides a quantitative, scalable method to assess explanations, advance a cohesive theory of interpretability in algorithmic neural computations, and offers reproducible experiments and code for broader adoption.

Abstract

A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models in a step towards unifying the explanations of previous works (Chughtai et al., 2023; Stander et al., 2024). Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of the extent to which we faithfully and concisely explain model internals. In the main text, we focus on the symmetric group S5. For models trained on this group, our explanation yields a guarantee of model accuracy that runs 3x faster than brute force and gives a >=95% accuracy bound for 45% of the models we trained. We were unable to obtain nontrivial non-vacuous accuracy bounds using only explanations from previous works.

Paper Structure

This paper contains 57 sections, 9 theorems, 40 equations, 10 figures, 3 tables.

Key Result

Lemma 6.4

Let $H$ be a subgroup of $G$. The Fourier transform of a function constant on the cosets of $H$ is nonzero only at the irreducible components of the permutation representation corresponding to the action of $G$ on $G/H$.

Figures (10)

  • Figure 1: Examples of $\rho$-sets extracted directly from the weights of models trained on the symmetric group $S_4$ (left, a tetrahedron) and the alternating group $A_5$ (right, an icosahedron). Both lie in ${\mathbb{R}}^3$. The vectors of the $\rho$-sets are depicted as points---the connecting edges are merely for illustration. See Section \ref{['sec:prelims']} for the definition of $\rho$-sets and Section \ref{['sec:rhoset']} for how they are in by models to compute the group operation. See Figure \ref{['fig:S4']} and Figure \ref{['fig:A5']} for compact proof bound results for $S_4$ and $A_5$, respectively. The standard irrep of $S_5$, the focus of the main text, is four-dimensional and hence more difficult to visualize.
  • Figure 2: Margin lower bound vs. logit distance upper bound over $x,y\in S_5$ for $V_{\mathrm{irrep}}$ and $V_{\mathrm{coset}}$ on a single example model. The accuracy lower bound is precisely the number of points for which the margin lower bound is larger than the logit upper bound (shaded region); in this example, the bound from $V_{\mathrm{irrep}}$ is 100% while that from $V_{\mathrm{coset}}$ is 0%. The margin lower bound of $V_{\mathrm{irrep}}$ is constant due to bi-equivariance.
  • Figure 3: Accuracy bound vs. computation time for $V_{\mathrm{irrep}}$ and $V_{\mathrm{brute}}$ on 100 models trained on $S_5$. Points in green ($V_{\mathrm{irrep}}$ unexpl) are models for which we find by inspection that our $\rho$-sets explanation does not hold, i.e. either \ref{['cond:avar_bad']} or \ref{['cond:irrep_bad']}. Mean accuracy bound is 100% for $V_{\mathrm{brute}}$ (orange), 0% for $V_{\mathrm{coset}}$ (not shown), 50.4% for $V_{\mathrm{irrep}}$ (union of blue and green), and 91.7% for $V_{\mathrm{irrep}}$ when only including models for which neither \ref{['cond:avar_bad']} nor \ref{['cond:irrep_bad']} occur (blue, 55% of total). Mean time elapsed is 2.20s for $V_{\mathrm{brute}}$ and 0.75s for $V_{\mathrm{irrep}}$. The asymptotic time complexity of $V_{\mathrm{brute}}$ is $O(m\lvert G\rvert^3)$ while that of $V_{\mathrm{irrep}}$ is $O(m\lvert G\rvert^2)$.
  • Figure 4: Plots of normalized variance $\mathbb{E}_i[\lVert{\bm{a}}_i-\mathbb{E}_i{\bm{a}}_i\rVert_2^2]/\mathbb{E}_i[\lVert{\bm{a}}_i\rVert_2^2]$ vs. model loss and weight norm, where ${\bm{a}}_i$ is the projection vector for neuron $i$, and expectation is taken across neurons within the 4d standard irrep of $S_5$. Each point is one model out of 100 trained on $S_5$. Notice that constant ${\bm{a}}_i$ across neurons is correlated with better model performance and lower weight norm.
  • Figure 5: Normalized distance between original and idealized model parameters ${\lVert{\bm{w}}-\hat{{\bm{w}}}\rVert_2^2/\lVert{\bm{w}}\rVert_2^2}$ (i.e. $1-R^2$) for each of left embedding ${\bm{w}}_l$, right embedding ${\bm{w}}_r$, unembedding ${\bm{w}}_u$, and unembed bias ${\bm{w}}_b$ of 100 models trained on $S_5$. Green boxes include all models while blue boxes exclude models for which we find that the $\rho$-set explanation does not hold (i.e. either \ref{['cond:avar_bad']} or \ref{['cond:irrep_bad']}).
  • ...and 5 more figures

Theorems & Definitions (19)

  • Definition 6.1: chughtai2023
  • Definition 6.2: stander2024
  • Definition 6.3
  • Lemma 6.4: stander2024
  • Proposition A.2
  • proof
  • Proposition B.3
  • proof
  • Lemma G.1
  • Lemma G.2
  • ...and 9 more