Secret mixtures of experts inside your LLM

Enric Boix-Adsera

Secret mixtures of experts inside your LLM

Enric Boix-Adsera

TL;DR

This work investigates why dense MLPs in transformers resemble sparse Mixture-of-Experts (MoE) computations. It develops a theoretical framework showing Gaussian inputs hinder MoE approximation while dictionary-sparse activation enables efficient MoE representations, and validates this with distillation experiments on pretrained LLMs that reveal a secret MoE structure in activation distributions. The results demonstrate that activation-distribution structure—not Gaussianity—drives the MoE-like behavior and show that low-rank MoE routers can match or exceed dense MLP performance with far fewer active parameters. The authors also advocate a distillation-based paradigm for rapid theory testing and architecture design, offering practical guidance for designing efficient MoE-ready components in future models.

Abstract

Despite being one of the earliest neural network layers, the Multilayer Perceptron (MLP) is arguably one of the least understood parts of the transformer architecture due to its dense computation and lack of easy visualization. This paper seeks to understand the MLP layers in dense LLM models by hypothesizing that these layers secretly approximately perform a sparse computation -- namely, that they can be well approximated by sparsely-activating Mixture of Experts (MoE) layers. Our hypothesis is based on a novel theoretical connection between MoE models and Sparse Autoencoder (SAE) structure in activation space. We empirically validate the hypothesis on pretrained LLMs, and demonstrate that the activation distribution matters -- these results do not hold for Gaussian data, but rather rely crucially on structure in the distribution of neural network activations. Our results shine light on a general principle at play in MLP layers inside LLMs, and give an explanation for the effectiveness of modern MoE-based transformers. Additionally, our experimental explorations suggest new directions for more efficient MoE architecture design based on low-rank routers.

Secret mixtures of experts inside your LLM

TL;DR

Abstract

Secret mixtures of experts inside your LLM

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (30)