Weight-based Decomposition: A Case for Bilinear MLPs
Michael T. Pearce, Thomas Dooms, Alice Rigg
TL;DR
This work tackles interpretability of MLP-based components in transformers by focusing on bilinear MLPs, which remove the gating nonlinearity while preserving expressiveness through a third-order tensor. The authors introduce a weight-based eigenvector decomposition that rewrites the bilinear computations as a set of sparsely interacting features, fully equivalent to the original model. They demonstrate interpretable top eigenvectors on MNIST and provide preliminary evidence of interpretable language-model features in Tiny Stories, and show that pretrained models (TinyLlama-1.1B) can be finetuned to bilinear variants with competitive loss. Regularization via latent noise improves interpretability and can enhance generalization, while noting limitations like polysemanticity and scalability. This approach offers a potential bridge between weights and interpretable features, with implications for mechanistic interpretability and model debugging.
Abstract
Gated Linear Units (GLUs) have become a common building block in modern foundation models. Bilinear layers drop the non-linearity in the "gate" but still have comparable performance to other GLUs. An attractive quality of bilinear layers is that they can be fully expressed in terms of a third-order tensor and linear operations. Leveraging this, we develop a method to decompose the bilinear tensor into a set of sparsely interacting eigenvectors that show promising interpretability properties in preliminary experiments for shallow image classifiers (MNIST) and small language models (Tiny Stories). Since the decomposition is fully equivalent to the model's original computations, bilinear layers may be an interpretability-friendly architecture that helps connect features to the model weights. Application of our method may not be limited to pretrained bilinear models since we find that language models such as TinyLlama-1.1B can be finetuned into bilinear variants.
