Table of Contents
Fetching ...

The Persian Rug: solving toy models of superposition using large-scale symmetries

Aditya Cowsik, Kfir Dolev, Alex Infanger

TL;DR

A complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension is presented and it is shown that the model is near-optimal among recently proposed architectures.

Abstract

We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in arXiv:2209.10652, compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model's performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at https://github.com/KfirD/PersianRug .

The Persian Rug: solving toy models of superposition using large-scale symmetries

TL;DR

A complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension is presented and it is shown that the model is near-optimal among recently proposed architectures.

Abstract

We present a complete mechanistic description of the algorithm learned by a minimal non-linear sparse data autoencoder in the limit of large input dimension. The model, originally presented in arXiv:2209.10652, compresses sparse data vectors through a linear layer and decompresses using another linear layer followed by a ReLU activation. We notice that when the data is permutation symmetric (no input feature is privileged) large models reliably learn an algorithm that is sensitive to individual weights only through their large-scale statistics. For these models, the loss function becomes analytically tractable. Using this understanding, we give the explicit scalings of the loss at high sparsity, and show that the model is near-optimal among recently proposed architectures. In particular, changing or adding to the activation function any elementwise or filtering operation can at best improve the model's performance by a constant factor. Finally, we forward-engineer a model with the requisite symmetries and show that its loss precisely matches that of the trained models. Unlike the trained model weights, the low randomness in the artificial weights results in miraculous fractal structures resembling a Persian rug, to which the algorithm is oblivious. Our work contributes to neural network interpretability by introducing techniques for understanding the structure of autoencoders. Code to reproduce our results can be found at https://github.com/KfirD/PersianRug .

Paper Structure

This paper contains 28 sections, 49 equations, 7 figures.

Figures (7)

  • Figure 1: The Persian rug, an artificial set of weights matching trained model performance.
  • Figure 2: Loss curves of trained models, Persian rug models, and optimal linear models as a function of the compression ratio.
  • Figure 3: Plot of the first $30\times 30$$W$ elements and the corresponding bias (${\mathbf{b}}$) components, at $p = 4.5\%$ and ratio $n_s = 512$. The diagonal components are all at similar values of $1.29 \pm .01$ (one standard deviation) while the off-diagonal components are approximately mean-zero, appearing like noise. The bias elements are all negative around $-.18 \pm .01$. This statistical uniformity is a permutation symmetry across the sparse features.
  • Figure 4: Permutation symmetry of diagonal values. We plot the mean-square fluctuation of the diagonal values corresponding to each model. Models are trained as a function of $p$ and $\frac{n_d}{n_s}$. The emergence of symmetry as $n_s$ grows (at all locations in the diagram) is a crucial element of the algorithm implemented by the autoencoders.
  • Figure 5: Permutation symmetry of bias values. We plot the mean-square fluctuation of values in the bias vectors corresponding to each model, which are trained as a function of $p$ and $\frac{n_d}{n_s}$. As $n_s$ increases the fluctuation over bias elements generally decreases in all trained models.
  • ...and 2 more figures