Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Chun Hei Yip; Rajashree Agrawal; Lawrence Chan; Jason Gross

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

TL;DR

This work tackles the problem of compressing nonlinear feature-maps in mechanistic interpretability, focusing on the ReLU MLP within modular addition models. By applying an infinite-width lens, it reveals that the pizza transformer implements a numerical quadrature that doubles input frequencies, yielding a compact, frequency-based explanation for logits via an amplitude-phase Fourier form. The authors derive non-vacuous, linear-time bounds on MLP behavior, validate the integral interpretation empirically, and analyze the role of secondary frequencies to explain deviations from clock-like behavior. Together, these results demonstrate a concrete pathway to compress and reason about nonlinear components of transformer-like models, with potential implications for guarantees and anomaly detection in AI systems.

Abstract

The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration.

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

TL;DR

Abstract

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (39)