Table of Contents
Fetching ...

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

Chun Hei Yip, Rajashree Agrawal, Lawrence Chan, Jason Gross

TL;DR

This work tackles the problem of compressing nonlinear feature-maps in mechanistic interpretability, focusing on the ReLU MLP within modular addition models. By applying an infinite-width lens, it reveals that the pizza transformer implements a numerical quadrature that doubles input frequencies, yielding a compact, frequency-based explanation for logits via an amplitude-phase Fourier form. The authors derive non-vacuous, linear-time bounds on MLP behavior, validate the integral interpretation empirically, and analyze the role of secondary frequencies to explain deviations from clock-like behavior. Together, these results demonstrate a concrete pathway to compress and reason about nonlinear components of transformer-like models, with potential implications for guarantees and anomaly detection in AI systems.

Abstract

The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration.

Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration

TL;DR

This work tackles the problem of compressing nonlinear feature-maps in mechanistic interpretability, focusing on the ReLU MLP within modular addition models. By applying an infinite-width lens, it reveals that the pizza transformer implements a numerical quadrature that doubles input frequencies, yielding a compact, frequency-based explanation for logits via an amplitude-phase Fourier form. The authors derive non-vacuous, linear-time bounds on MLP behavior, validate the integral interpretation empirically, and analyze the role of secondary frequencies to explain deviations from clock-like behavior. Together, these results demonstrate a concrete pathway to compress and reason about nonlinear components of transformer-like models, with potential implications for guarantees and anomaly detection in AI systems.

Abstract

The goal of mechanistic interpretability is discovering simpler, low-rank algorithms implemented by models. While we can compress activations into features, compressing nonlinear feature-maps -- like MLP layers -- is an open problem. In this work, we present the first case study in rigorously compressing nonlinear feature-maps, which are the leading asymptotic bottleneck to compressing small transformer models. We work in the classic setting of the modular addition models, and target a non-vacuous bound on the behaviour of the ReLU MLP in time linear in the parameter-count of the circuit. To study the ReLU MLP analytically, we use the infinite-width lens, which turns post-activation matrix multiplications into approximate integrals. We discover a novel interpretation of} the MLP layer in one-layer transformers implementing the ``pizza'' algorithm: the MLP can be understood as evaluating a quadrature scheme, where each neuron computes the area of a rectangle under the curve of a trigonometric integral identity. Our code is available at https://tinyurl.com/mod-add-integration.

Paper Structure

This paper contains 34 sections, 48 equations, 39 figures, 3 tables.

Figures (39)

  • Figure 1: (Left) There are finitely many neurons in the model (indexed by $i$). The function $f_x(\xi_i)$ is ReLU applied to the inputs $x$. The weight of the connection to each output $c$ is $g_c(\xi_i)$ times a neuron-specific output-independent normalization factor $w_i$. (Right) Taking the limit as the number of neurons goes to infinity turns the sum over neurons into an integral. Compressing the resulting analytic expression allows us to compress the MLP.
  • Figure 2: The MLP approximately computes the integral $\int_{-\pi}^\pi h(\phi)\,\mathrm{d}\phi$. The computed integral is for frequency $k = 12$ when $a+b = c = 0$. The widths and heights of rectangles are generated by the actual weights in a trained model.
  • Figure 3: The input ($\phi_i$) and output ($\psi_i$) phase shift angles for frequency $k = 12$, where $\psi_i \approx 2\phi_i\pmod{2\pi}$ for the primary frequency of each neuron. The line has $R^2>0.99$ and the intervals between angles have mean width $0.054$, and standard deviation $0.049$. This shows that the angles are roughly uniform.
  • Figure 4: Histograms of the variance explained by the largest Fourier frequency component for the pre-activations and neuron-logit map $W_L$ for each of the 512 neurons in the mainline model.
  • Figure 5: We plot the error bound $\pm 2(\phi - \phi_i)$, depicted in red, for frequency $k = 12$. We observe that the red area includes both the actual curve and the numerical integration approximation. We use $\pm2$ as a bound on $h'(\phi)$ because the Lipschitz constant of $h$ is 2.
  • ...and 34 more figures