Table of Contents
Fetching ...

PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks

Daniel Nobrega Medeiros

Abstract

Biological neural systems employ diverse neurotransmitters -- glutamate, GABA, dopamine, acetylcholine -- to implement distinct signal-processing modalities within shared neural circuits. In contrast, modern transformers apply a single fixed activation function across all feed-forward neurons. We introduce PolyGLU (Polychromatic Gated Linear Unit), a drop-in replacement for SwiGLU that enables each FFN neuron to dynamically route among K=4 activation functions via a differentiable mechanism combining learned static preferences with input-conditioned gating, trained end-to-end with Gumbel-Softmax. We train PolychromaticLM, a 597M-parameter transformer, on ~10B tokens using a single NVIDIA A100 GPU. Our key finding is emergent routing behavior: without any explicit sparsity loss or entropy regularization, the routing mechanism converges to near-deterministic activation selections (mean dynamic entropy = 0.030% of maximum), with a striking depth-dependent specialization pattern -- early layers prefer GELU while deep layers strongly favor Tanh. Three layers maintain elevated routing entropy, suggesting computational flexibility points. The routing architecture adds only 0.23% parameter overhead (~1.4M parameters) and proves fully robust to supervised fine-tuning: routing entropy remains constant at ln(4) throughout 13,067 SFT steps. On standard benchmarks, PolychromaticLM achieves 62-89% of Qwen3-0.6B-Base performance despite training on 3,600x fewer tokens. All code, weights, and training infrastructure are released under Apache 2.0.

PolyGLU: State-Conditional Activation Routing in Transformer Feed-Forward Networks

Abstract

Biological neural systems employ diverse neurotransmitters -- glutamate, GABA, dopamine, acetylcholine -- to implement distinct signal-processing modalities within shared neural circuits. In contrast, modern transformers apply a single fixed activation function across all feed-forward neurons. We introduce PolyGLU (Polychromatic Gated Linear Unit), a drop-in replacement for SwiGLU that enables each FFN neuron to dynamically route among K=4 activation functions via a differentiable mechanism combining learned static preferences with input-conditioned gating, trained end-to-end with Gumbel-Softmax. We train PolychromaticLM, a 597M-parameter transformer, on ~10B tokens using a single NVIDIA A100 GPU. Our key finding is emergent routing behavior: without any explicit sparsity loss or entropy regularization, the routing mechanism converges to near-deterministic activation selections (mean dynamic entropy = 0.030% of maximum), with a striking depth-dependent specialization pattern -- early layers prefer GELU while deep layers strongly favor Tanh. Three layers maintain elevated routing entropy, suggesting computational flexibility points. The routing architecture adds only 0.23% parameter overhead (~1.4M parameters) and proves fully robust to supervised fine-tuning: routing entropy remains constant at ln(4) throughout 13,067 SFT steps. On standard benchmarks, PolychromaticLM achieves 62-89% of Qwen3-0.6B-Base performance despite training on 3,600x fewer tokens. All code, weights, and training infrastructure are released under Apache 2.0.
Paper Structure (50 sections, 8 equations, 13 figures, 7 tables)

This paper contains 50 sections, 8 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Pre-training loss curve. Loss decreases from 12.13 to 1.31 over ${\sim}$10.24B tokens (19,531 steps). A mid-training intervention at step 10,000 (Section \ref{['sec:weight_decay']}) introduced no visible discontinuity.
  • Figure 2: Combined training dynamics showing loss, learning rate, Gumbel-Softmax temperature ($\tau$), and throughput over the full training run.
  • Figure 3: Per-layer dynamic routing entropy at convergence (step 19,531). Most layers achieve entropy $< 10^{-4}$, with three notable exceptions: layers 9, 16, and 17 maintain elevated entropy, suggesting computational flexibility points.
  • Figure 4: Evolution of dynamic routing entropy during training. Most layers converge to near-zero entropy, while layers 9, 16, and 17 maintain elevated values. Layer 17 notably increases its entropy in the final phase.
  • Figure 5: Neurotransmitter heatmap showing the preferred activation function per neuron (columns) across layers (rows). A clear depth-dependent specialization emerges: early layers favor GELU (blue), while deep layers strongly prefer Tanh (orange).
  • ...and 8 more figures