Table of Contents
Fetching ...

Secret mixtures of experts inside your LLM

Enric Boix-Adsera

TL;DR

This work investigates why dense MLPs in transformers resemble sparse Mixture-of-Experts (MoE) computations. It develops a theoretical framework showing Gaussian inputs hinder MoE approximation while dictionary-sparse activation enables efficient MoE representations, and validates this with distillation experiments on pretrained LLMs that reveal a secret MoE structure in activation distributions. The results demonstrate that activation-distribution structure—not Gaussianity—drives the MoE-like behavior and show that low-rank MoE routers can match or exceed dense MLP performance with far fewer active parameters. The authors also advocate a distillation-based paradigm for rapid theory testing and architecture design, offering practical guidance for designing efficient MoE-ready components in future models.

Abstract

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -- namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -- these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.

Secret mixtures of experts inside your LLM

TL;DR

This work investigates why dense MLPs in transformers resemble sparse Mixture-of-Experts (MoE) computations. It develops a theoretical framework showing Gaussian inputs hinder MoE approximation while dictionary-sparse activation enables efficient MoE representations, and validates this with distillation experiments on pretrained LLMs that reveal a secret MoE structure in activation distributions. The results demonstrate that activation-distribution structure—not Gaussianity—drives the MoE-like behavior and show that low-rank MoE routers can match or exceed dense MLP performance with far fewer active parameters. The authors also advocate a distillation-based paradigm for rapid theory testing and architecture design, offering practical guidance for designing efficient MoE-ready components in future models.

Abstract

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -- namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -- these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.

Paper Structure

This paper contains 32 sections, 7 theorems, 36 equations, 12 figures, 1 table.

Key Result

Theorem 3.3

There are universal constants $c,c' > 0$ such that the following is true. Under isotropic Gaussian input distribution $D = N(0,I_d/d)$, there is no$(m,k)$-MoE with hard gating function and a number of active neurons $kd_{exp} < d/2$ and number of expert configurations $m^k \leq \exp(cd)$, that can $

Figures (12)

  • Figure 1: In this paper we hypothesize, and then experimentally validate, that an MLP layer in the middle of a pretrained transformer can be effectively described by a sparsely-activating mixture-of-experts layer.
  • Figure 2: We distill the middle MLP layer of Pythia-410M to either a smaller MLP student model, or an MoE student model with fewer active parameters. On the left, we see that under the input distribution induced by the previous layers, MoE students can achieve the same distillation performance with fewer active parameters than MLPs. On the right, under a Gaussian input distribution with the same mean and covariance, MoE students yield no significant gain, showing that the data distribution is crucial for the secret MoE structure. See Section \ref{['sec:experimental-validation']} and Appendix \ref{['app:additional-experiments']} for details and further experiments.
  • Figure 3: The dataset of internal activations is created by pushing forward datasets of text through all layers preceding the MLP that we seek to distill.
  • Figure 4: The unexplained fraction of the variance in the outputs from distilling the middle MLP layer of Pythia-70M (first row), Gemma-270M (second row). Results for Pythia-410M are in Figure \ref{['fig:teaser-experiments']}. In the left column, we observe that over the activation dataset ${\mathcal{D}}^{act}$ sparse MoE students are able to capture a significantly higher amount of the variance than corresponding MLP students with the same number of active neurons. In particular, for Pythia-410M and Gemma-3-270M there are cases in which the sparse MoE captures the same variance as the MLP using 8 times fewer active neurons. On the other hand, the distillation results in the right column demonstrate that MoE students have little advantage when the data distribution is instead Gaussian (with matched mean and covariance).
  • Figure 5: A popular approach in deep learning theory (left) contrasted with our paper's approach (right) to understanding what a model is doing.
  • ...and 7 more figures

Theorems & Definitions (30)

  • Definition 2.1
  • Definition 2.2
  • Definition 3.2
  • Theorem 3.3: Inapproximability of identity by sparse MoEs under Gaussian data distribution
  • proof : Proof sketch
  • Definition 3.4: Dictionary-sparse structure
  • Definition 3.5: Approximately orthogonal dictionary
  • Theorem 3.6: Linear functions are approximable by sparse MoEs under sparse-dictionary data
  • proof
  • Remark 3.7: Making sense of the number of active parameters
  • ...and 20 more