The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Peter Balogh

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Peter Balogh

TL;DR

It is proposed that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

Abstract

We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture -- seven "default-ON" neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive -- creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% -- exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

TL;DR

Abstract

Paper Structure (33 sections, 1 equation, 3 figures, 12 tables)

This paper contains 33 sections, 1 equation, 3 figures, 12 tables.

Introduction
The Smooth Function Framing
The Kolmogorov Connection
An Analogy: Shannon's Switch
Method
Polynomial Probing
Branch Detection
Binary Feature Extraction
Results I: Polynomials Fail Categorically
Results II: Binary Routing Structure
From Activation Properties to Learned Structure
Neuron Forensics
Neuron 2123: The Exception Handler
Two Regimes: The Consensus Gradient
Binary Patterns as Pseudocode
...and 18 more sections

Figures (3)

Figure 1: Two Regimes. N2123 fire rate (red, left axis) and MLP output norm (blue, right axis) as a function of default-ON consensus neuron count. The gradient is perfectly monotonic: consensus breakdown triggers the exception handler and 2.8$\times$ output norm. 500K WikiText-103 tokens, GPT-2 Small Layer 11.
Figure 2: Exception handler architecture emerging from learned weights in Layer 11. Seven default-ON consensus neurons and N2123 are 93--98% mutually exclusive. When consensus holds, the MLP operates near-linearly (norm $\approx$ 70); when it breaks down, N2123 fires and triggers full nonlinear computation (norm $\approx$ 194).
Figure 3: Extracted binary logic from Layer 11 MLP (simplified). Pattern enrichment validated at 500K tokens using the nonlinearity delta (least-squares residual) metric. Full pattern table in supplementary material.

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

TL;DR

Abstract

The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)