Towards a unified and verified understanding of group-operation networks
Wilson Wu, Louis Jaburi, Jacob Drori, Jason Gross
TL;DR
This work tackles the interpretability of one-hidden-layer networks trained to perform the binary operation on finite groups, addressing a gap in rigorous evaluation of mechanistic explanations. It reveals that trained models implement bi-equivariant circuitry organized around irreducible representations and $\rho$-sets, unifying prior accounts of coset concentration and irrep sparsity into a single, testable framework. By translating these mechanistic insights into compact proofs of model performance, the authors demonstrate nontrivial accuracy guarantees (and faster verification) for a majority of models, especially on the symmetric group $S_5$, while also highlighting cases where the explanation is incomplete. This approach provides a quantitative, scalable method to assess explanations, advance a cohesive theory of interpretability in algorithmic neural computations, and offers reproducible experiments and code for broader adoption.
Abstract
A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models in a step towards unifying the explanations of previous works (Chughtai et al., 2023; Stander et al., 2024). Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of the extent to which we faithfully and concisely explain model internals. In the main text, we focus on the symmetric group S5. For models trained on this group, our explanation yields a guarantee of model accuracy that runs 3x faster than brute force and gives a >=95% accuracy bound for 45% of the models we trained. We were unable to obtain nontrivial non-vacuous accuracy bounds using only explanations from previous works.
