Table of Contents
Fetching ...

Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

Peter Balogh

TL;DR

It is established that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r<0.05$), and a gate with d+1 parameters decides when to replace the full MLP with a linear surrogate.

Abstract

We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement -- and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.

Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

TL;DR

It is established that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero (), and a gate with d+1 parameters decides when to replace the full MLP with a linear surrogate.

Abstract

We investigate when transformer MLP nonlinearity is actually necessary. A gate with parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero (). The routing decision is fully contextual. Despite weak per-instance predictability, the gate exploits a heavily skewed distribution where most MLP computations are near-linear, achieving 25-56% linear routing at <1% perplexity cost in GPT-2. In GPT-2 Large, 11 of 36 layers beat baseline with gating and no layer exceeds 3.7% all-linear cost. This success is architecture-dependent: Pythia models show higher costs, though Pythia-2.8B's full 32-layer sweep reveals one layer that narrowly beats baseline. As a proof of concept, we progressively replace middle-layer MLPs with frozen linear matrices: 5 of 24 layers linearize at zero cost. With a full training budget, 4 linearized layers yield a 10.2% perplexity improvement -- and a two-phase gated approach pushes this to 17.3%, beating a vanilla fine-tuning control and confirming that the nonlinear MLPs at these layers were actively harmful.
Paper Structure (45 sections, 7 equations, 3 figures, 11 tables)

This paper contains 45 sections, 7 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: All-linear replacement cost vs. best-gate cost by layer (GPT-2 Medium). Red circles mark layers where gating improves perplexity over baseline. The gate reduces cost dramatically at every layer, and the middle layers (L8--L18) are nearly free to linearize.
  • Figure 2: Mean MLP contribution ($\delta$) by token type and layer depth (GPT-2 Medium). All categories show increasing nonlinearity demand at later layers, but within-category variance (not shown) is large, confirming that token type is a weak predictor.
  • Figure 3: Median all-linear perplexity cost vs. model size. Within Pythia, linearization cost decreases monotonically with scale. Both GPT-2 models (Medium with all 23 layers, Large with all 36 layers) are substantially cheaper than Pythia at comparable size, reflecting the architectural divide. Pythia-2.8B uses all 32 layers; other Pythia models use sampled layers.