Table of Contents
Fetching ...

Darkness Visible: Reading the Exception Handler of a Language Model

Peter Balogh

Abstract

The final MLP of GPT-2 Small exhibits a fully legible routing program -- 27 named neurons organized into a three-tier exception handler -- while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus-exception crossover -- where MLP intervention shifts from helpful to harmful -- is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that "knowledge neurons" (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden-path experiment reveals a reversed garden-path effect -- GPT-2 uses verb subcategorization immediately, consistent with the exception handler operating at token-level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer -- in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: https://github.com/pbalogh/transparent-gpt2

Darkness Visible: Reading the Exception Handler of a Language Model

Abstract

The final MLP of GPT-2 Small exhibits a fully legible routing program -- 27 named neurons organized into a three-tier exception handler -- while the knowledge it routes remains entangled across ~3,040 residual neurons. We decompose all 3,072 neurons (to numerical precision) into: 5 fused Core neurons that reset vocabulary toward function words, 10 Differentiators that suppress wrong candidates, 5 Specialists that detect structural boundaries, and 7 Consensus neurons that each monitor a distinct linguistic dimension. The consensus-exception crossover -- where MLP intervention shifts from helpful to harmful -- is statistically sharp (bootstrap 95% CIs exclude zero at all consensus levels; crossover between 4/7 and 5/7). Three experiments show that "knowledge neurons" (Dai et al., 2022), at L11 of this model, function as routing infrastructure rather than fact storage: the MLP amplifies or suppresses signals already present in the residual stream from attention, scaling with contextual constraint. A garden-path experiment reveals a reversed garden-path effect -- GPT-2 uses verb subcategorization immediately, consistent with the exception handler operating at token-level predictability rather than syntactic structure. This architecture crystallizes only at the terminal layer -- in deeper models, we predict equivalent structure at the final layer, not at layer 11. Code and data: https://github.com/pbalogh/transparent-gpt2

Paper Structure

This paper contains 54 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: The exception handler as pseudocode. The pseudocode represents the functional organization observed in activation patterns, not a literal causal program: N2123 reliably indicates which path is active but does not control it balogh2026discrete.
  • Figure 2: L11 MLP effect by consensus level. Blue: MLP helps. Red: MLP hurts. Crossover between 4/7 and 5/7 matches the causal analysis of balogh2026discrete.
  • Figure 3: The DC offset paradox. Left: Exception-path output norm. Right: PPL impact. The Core dominates at 54% of norm but contributes only $+$0.2% PPL---a DC offset the residual stream already provides.
  • Figure 4: Top-1 accuracy across layers. The developmental arc holds under both logit lens (dashed) and tuned lens (solid).