Converting MLPs into Polynomials in Closed Form
Nora Belrose, Alice Rigg
TL;DR
This work develops a principled, analytic framework to convert pretrained MLPs and GLUs into polynomial functions that globally minimize MSE under a maximum-entropy input model, enabling closed-form linear and quadratic approximants. On Gaussian-mixture approximations of MNIST, quadratic approximants explain $>94$–$95\%$ of the variance in network outputs, facilitating mechanistic interpretability via spectral decompositions and enabling SVD-based adversarial attacks that transfer to the original networks. The study also reveals training-time dynamics consistent with the distributional simplicity bias, showing an initial phase where networks appear simpler and a later phase where nonlinear (quadratic) structure dominates. These results provide a mathematically grounded lens for understanding network representations and suggest extensions to transformers and FFN interpretability using polynomial bases and spectral methods.
Abstract
Recent work has shown that purely quadratic functions can replace MLPs in transformers with no significant loss in performance, while enabling new methods of interpretability based on linear algebra. In this work, we theoretically derive closed-form least-squares optimal approximations of feedforward networks (multilayer perceptrons and gated linear units) using polynomial functions of arbitrary degree. When the $R^2$ is high, this allows us to interpret MLPs and GLUs by visualizing the eigendecomposition of the coefficients of their linear and quadratic approximants. We also show that these approximants can be used to create SVD-based adversarial examples. By tracing the $R^2$ of linear and quadratic approximants across training time, we find new evidence that networks start out simple, and get progressively more complex. Even at the end of training, however, our quadratic approximants explain over 95% of the variance in network outputs.
