Table of Contents
Fetching ...

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

Nandan Kumar Jha, Brandon Reagen

TL;DR

An asymmetric spectral scaling law is found: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance, suggesting that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early.

Abstract

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

Spectral Scaling Laws in Language Models: How Effectively Do Feed-Forward Networks Use Their Latent Space?

TL;DR

An asymmetric spectral scaling law is found: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance, suggesting that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early.

Abstract

As large language models (LLMs) scale, the question is not only how large they become, but how much of their capacity is effectively utilized. Existing scaling laws relate model size to loss, yet overlook how components exploit their latent space. We study feed-forward networks (FFNs) and recast width selection as a spectral utilization problem. Using a lightweight diagnostic suite -- Hard Rank (participation ratio), Soft Rank (Shannon rank), Spectral Concentration, and the composite Spectral Utilization Index (SUI) -- we quantify how many latent directions are meaningfully activated across LLaMA, GPT-2, and nGPT families. Our key finding is an asymmetric spectral scaling law: soft rank follows an almost perfect power law with FFN width, while hard rank grows only sublinearly and with high variance. This asymmetry suggests that widening FFNs mostly adds low-energy tail directions, while dominant-mode subspaces saturate early. Moreover, at larger widths, variance further collapses into a narrow subspace, leaving much of the latent space under-utilized. These results recast FFN width selection as a principled trade-off between tail capacity and dominant-mode capacity, offering concrete guidance for inference-efficient LLM design.

Paper Structure

This paper contains 15 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Spectral rank vs. FFN hidden dimension in LLaMA-130M base model, with width sweep $D$ = $\alpha d$ (total parameters therefore differ across $\alpha$). Log-Log fits: Soft rank follows a linear power-law fit ($\beta$=1.06, $R^2$=0.93), while hard rank grows sublinearly ($\beta$=0.60, $R^2$=0.68), indicating width mainly adds low-energy tail directions rather than enlarging the high-energy dominant-mode subspace.
  • Figure 2: Asymmetric spectral scaling with FFN width in LLaMA-style Pre-LN models. Soft rank (SRank, red) and hard rank (HRank, blue) vs. FFN hidden dimension $D$ on log-log axes for (a) 70M, (b) 130M, and (c) 250M backbones (fixed $d$, width sweep $D \in \{1, 2, 2.67, 4, 5, 6, 7, 8\}$). Dashed lines are power-law fits; annotations mark $\alpha d$. Soft-rank exponents cluster near unity ($\beta = \{0.873, 1.069, 0.872\}$; $R^2 = \{0.770, 0.980, 0.850\}$), while hard-rank exponents are smaller and noisier ($\beta = \{0.441, 0.604, 0.407\}$; $R^2 = \{0.248, 0.684, 0.268\}$). All networks are trained from scratch; markers show layer median values, and error bars indicate across-layer variability.
  • Figure 3: Spectral-rank utilization vs. FFN width in LLaMA-style Pre-LN models. We plot soft-rank utilization (SRank$/(D-1)$, red) and hard-rank utilization (HRank$/(D-1)$, blue) vs. FFN hidden dimension $D$ on log-log axes for 70M, 130M, and 250M backbones (fixed depth; width sweep $D = \alpha d, \alpha \in \{1, 2, 2.67, 4, 5, 6, 7, 8\}$). Dashed lines show power-law fits, highlighting that SRank scales nearly linearly with width while HRank grows more slowly and with higher variability. All networks are trained from scratch; markers indicate layer median, and error bars denote across-layer variability.
  • Figure 4: Power-law templates for spectral concentration. Cumulative-variance curves generated from synthetic power-law spectra $\lambda_k \propto k^{-\alpha}$ for three latent sizes $(D=768, 2048, 3072)$. Larger exponents ($\alpha$) front-load variance and push the curve upward. Coloured call-outs report the concentration value reached by benchmark cut-offs.
  • Figure 5: Training-time evolution of spectral scaling laws for LLaMA-130M (PreLN). Upper panels (a-d) show raw Hard- and Soft-Rank, while lower panels (e-h) illustrate normalized ranks (Rank utilization). (a,b) and (e,f) track the scaling exponent $\beta$ (blue, left axis) and fit quality $R^2$ (red, right axis), while (c,d) and (g,h) show the corresponding layer-averaged rank dynamics fo each FFN widths (D=1$d$ to 8$d$).
  • ...and 4 more figures