Do Language Models Use Their Depth Efficiently?
Róbert Csordás, Christopher D. Manning, Christopher Potts
TL;DR
The paper investigates whether deeper Transformer-based LLMs actually perform more complex computations or simply spread the same computations across more layers. Using causal interventions, residual analysis, and cross-model mappings across Llama 3.1, Qwen 3, and OLMo 2, the authors identify a mid-network phase transition where later layers contribute less to downstream computation and primarily refine the current token distribution. They show that deeper models do not systematically deploy new computations for harder tasks, and that linear mappings between shallow and deep models reveal a diagonal, “spread out” relationship rather than novel depth-enabled operations. An exploratory look at MoEUT suggests potential for more efficient use of depth, but overall the findings challenge the assumption that increased depth yields proportionally deeper computation in current architectures. The work highlights architectural and training considerations needed to harness depth effectively and questions the practicality of latent-thinking approaches that rely on deeper internal computations.
Abstract
Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.
