Table of Contents
Fetching ...

Inverse Depth Scaling From Most Layers Being Similar

Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

TL;DR

This work investigates how LLMs utilize depth, proposing a three-regime framework (compositional, procedural, ensemble) and a decomposed loss form $L = \frac{c_m}{m^{\alpha_m}} + \frac{c_\ell}{\ell^{\alpha_\ell}} + \frac{c_D}{D^{\alpha_D}} + L_0$. Through hidden-state analyses of real LLMs and controlled toy-model experiments, the authors show that a dominant depth term scales roughly as $L_\ell \propto 1/\ell$, and provide mechanistic evidence that most layers contribute via ensemble averaging rather than compositional depth, with procedural-assembly behavior emerging under tied-teacher dynamics. The key contributions include (i) empirical decomposition of depth-related loss, (ii) identification of inverse-depth scaling linked to ensemble effects, and (iii) toy-model demonstrations distinguishing ensemble averaging from procedural assembly. The findings suggest that to improve LLM efficiency, architectural innovations should encourage true depth compositionality, rather than rely on depth as a collection of similar, redundantly updating layers.

Abstract

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.

Inverse Depth Scaling From Most Layers Being Similar

TL;DR

This work investigates how LLMs utilize depth, proposing a three-regime framework (compositional, procedural, ensemble) and a decomposed loss form . Through hidden-state analyses of real LLMs and controlled toy-model experiments, the authors show that a dominant depth term scales roughly as , and provide mechanistic evidence that most layers contribute via ensemble averaging rather than compositional depth, with procedural-assembly behavior emerging under tied-teacher dynamics. The key contributions include (i) empirical decomposition of depth-related loss, (ii) identification of inverse-depth scaling linked to ensemble effects, and (iii) toy-model demonstrations distinguishing ensemble averaging from procedural assembly. The findings suggest that to improve LLM efficiency, architectural innovations should encourage true depth compositionality, rather than rely on depth as a collection of similar, redundantly updating layers.

Abstract

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
Paper Structure (18 sections, 26 equations, 24 figures)

This paper contains 18 sections, 26 equations, 24 figures.

Figures (24)

  • Figure 1: Three regimes of how LLMs may utilize their layers.
  • Figure 2: Most data in most layers are processed in an even and incremental way. (a) Updates of hidden states measured by the angle between neighboring hidden states $\theta(h_l, h_{l+1})$ show incremental changes in most middle layers. (b) By PCA of $\theta(h_l, h_{l+1})$, most tokens are updated evenly in the middle, a small fraction stops updating early, which usually corresponds to the first tokens in documents. (c) Mean update decreases with depth. (d) The mean updates scales approximately inversely proportional to depth, suggesting more fine-grained updates rather than decomposing higher-level information. (e) Correlation between neighboring updates is small, suggesting non-smooth dynamics. (f) LLM loss roughly follows an inverse depth scaling. More details about hidden state analysis are in \ref{['app:hid']}, and those of loss fitting are in \ref{['app:scale']}.
  • Figure 3: The toy model can exhibit inverse depth scaling when the underlying transformation dynamics to learn are smooth, and the target distribution is sharp, or when the underlying dynamics to learn are noisy. (a) Architecture of the toy model. (b) Tied teacher weights ($\rho=1$) produce smooth dynamics and yield $\alpha_\ell=1$ at low teacher temperature (peaked output distributions). Independent teacher weights ($\rho=0$) generate non-smooth dynamics and have $\alpha_\ell=1$ across different temperatures. Error bars are standard errors. Details in \ref{['app:toy']}.
  • Figure 4: Matching smooth dynamics leads to the inverse depth scaling when not trained well. (a) The depth scaling exponent $\alpha_\ell$ is near $1$ in the early stage of training but increases to $3$ after training. Error bars are standard errors. (b) Low temperature makes training slow, yet all $\alpha_\ell$ tend to increase with more training. Maximum number of training steps $t_{\rm max} = 80000$. (c) Loss versus training steps for $\ell = 6$ from experiments in \ref{['fig:phenomena']} suggest that low teacher temperature requires longer training, and the corresponding student has not yet converged. Details in \ref{['app:toy']}.
  • Figure 5: Matching the full transformation as an ensemble may explain the inverse depth scaling in LLMs. (a) Hidden state updates of a student whose teacher has independent MLP weights are even across layers. Each gray line is from one input, and the dark line is averaged on dataset. (b) Mean update in one layer scales inversely proportional to depth. (c) Correlation between neighboring updates is small, suggesting no smooth dynamics. These hidden state features agree with ensemble averaging and are similar to those in LLMs. Details in \ref{['app:toy']}.
  • ...and 19 more figures