Inverse Depth Scaling From Most Layers Being Similar
Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore
TL;DR
This work investigates how LLMs utilize depth, proposing a three-regime framework (compositional, procedural, ensemble) and a decomposed loss form $L = \frac{c_m}{m^{\alpha_m}} + \frac{c_\ell}{\ell^{\alpha_\ell}} + \frac{c_D}{D^{\alpha_D}} + L_0$. Through hidden-state analyses of real LLMs and controlled toy-model experiments, the authors show that a dominant depth term scales roughly as $L_\ell \propto 1/\ell$, and provide mechanistic evidence that most layers contribute via ensemble averaging rather than compositional depth, with procedural-assembly behavior emerging under tied-teacher dynamics. The key contributions include (i) empirical decomposition of depth-related loss, (ii) identification of inverse-depth scaling linked to ensemble effects, and (iii) toy-model demonstrations distinguishing ensemble averaging from procedural assembly. The findings suggest that to improve LLM efficiency, architectural innovations should encourage true depth compositionality, rather than rely on depth as a collection of similar, redundantly updating layers.
Abstract
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.
