Table of Contents
Fetching ...

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Nandan Kumar Jha, Brandon Reagen

TL;DR

NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

Abstract

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

TL;DR

NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

Abstract

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.
Paper Structure (45 sections, 6 equations, 33 figures, 13 tables, 1 algorithm)

This paper contains 45 sections, 6 equations, 33 figures, 13 tables, 1 algorithm.

Figures (33)

  • Figure 1: NerVE quantifies nonlinear eigenspectrum dynamics in FFNs of GPT-2. FFN nonlinearity (GELU) regulates information flow by reinjecting variance, reactivating under-utilized directions (post-activation SE$\uparrow$ and PR$\uparrow$), and flattening the eigenspectrum, less top-heavy (post-activation EEE$\downarrow$). The JS heatmap shows a depth-localized transition band where redistribution is strongest.
  • Figure 2: Cumulative variance distribution across a 768-dimensional latent space. Higher values (shown on curves) indicate top-heavy concentration in a few dominant directions, while lower values reflect a more uniform distribution.
  • Figure 3: Eigenspectrum dynamics illustrate how FFN nonlinearities regulate information flow and reshape the eigenspectrum during training for GPT-2 (ReLU) on CodeParrot. Pre- and post-activation dynamics are shown for SE, PR, and EEE, highlighting how nonlinearities reinject variance and alter spectral structure. JS heatmaps (rightmost) capture the layer-wise distributional shift induced by nonlinearity. In-panel titles report Pearson correlations ($r$) between each metric and evaluation loss.
  • Figure 4: Eigenspectrum dynamics for norm-free GPT-2 (125M) models with GELU (top), ReLU (middle), and learnable-slope Leaky ReLU (bottom). Columns show layer-averaged SE (pre vs. post), PR gain (post to pre), post-activation EEE (yellow regions indicate top-heavy distribution), and JS (yellow regions highlight strong redistribution) across layers and training steps. Norm-free GELU exhibits spectral inertia in layers 0 to 5 (EEE $\rightarrow$ 1, JS $\rightarrow$ 0); whereas, ReLU and Leaky ReLU aggressively reinject variance (PR gain $>$200$\times$) and flattening the spectrum (EEE $<$ 0.3).
  • Figure 5: Impact of FFN (parametric) normalization in norm-free GPT-2 with learnable-slope leaky ReLU. Eigenspectrum dynamics are quantified by latent capacity (PR_post), spectral regularization and flattening ($\Delta$EEE and EEE_post), and distributional shift (JS). Top to bottom: Weight, Spectral, and Hyperspherical Normalization. Each method exhibits distinct JS localization and spectral patterns, showing different influences on FFN internal dynamics.
  • ...and 28 more figures