Table of Contents
Fetching ...

Do Language Models Use Their Depth Efficiently?

Róbert Csordás, Christopher D. Manning, Christopher Potts

TL;DR

The paper investigates whether deeper Transformer-based LLMs actually perform more complex computations or simply spread the same computations across more layers. Using causal interventions, residual analysis, and cross-model mappings across Llama 3.1, Qwen 3, and OLMo 2, the authors identify a mid-network phase transition where later layers contribute less to downstream computation and primarily refine the current token distribution. They show that deeper models do not systematically deploy new computations for harder tasks, and that linear mappings between shallow and deep models reveal a diagonal, “spread out” relationship rather than novel depth-enabled operations. An exploratory look at MoEUT suggests potential for more efficient use of depth, but overall the findings challenge the assumption that increased depth yields proportionally deeper computation in current architectures. The work highlights architectural and training considerations needed to harness depth effectively and questions the practicality of latent-thinking approaches that rely on deeper internal computations.

Abstract

Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.

Do Language Models Use Their Depth Efficiently?

TL;DR

The paper investigates whether deeper Transformer-based LLMs actually perform more complex computations or simply spread the same computations across more layers. Using causal interventions, residual analysis, and cross-model mappings across Llama 3.1, Qwen 3, and OLMo 2, the authors identify a mid-network phase transition where later layers contribute less to downstream computation and primarily refine the current token distribution. They show that deeper models do not systematically deploy new computations for harder tasks, and that linear mappings between shallow and deep models reveal a diagonal, “spread out” relationship rather than novel depth-enabled operations. An exploratory look at MoEUT suggests potential for more efficient use of depth, but overall the findings challenge the assumption that increased depth yields proportionally deeper computation in current architectures. The work highlights architectural and training considerations needed to harness depth effectively and questions the practicality of latent-thinking approaches that rely on deeper internal computations.

Abstract

Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.

Paper Structure

This paper contains 36 sections, 1 equation, 43 figures.

Figures (43)

  • Figure 1: Performance of 132 Open LLM Leaderboard open-llm-leaderboard-v2 base models as a function of depth. Colors represent different model families; dot size is proportional to parameter count. Linear regression in red, with 95% confidence interval. Depth is a significant predictor even in regressions that control for other scale-relevant factors (App. \ref{['app:depth-regression']}). Deeper models generally perform better.
  • Figure 2: Influence of layers and sublayers on the residual stream for Llama 3.1 70B. (a) Norm of contributions relative to the residual stream. A sharp drop is visible near the middle; later layers change the residual much less, with the exception of the last few layers. (b) Cosine similarity between the contributions and the corresponding residual shows a phase change at the middle of the network.
  • Figure 3: The maximum relative change in the layer's contribution when a previous layer is skipped, Llama 3.1 70B on GSM8K cobbe2021training. (a) Shows the maximum effect on the future computations for all tokens in the sequence, including the current token, while (b) isolates the effect only for the maximum of the future tokens. The range is limited between 0 and 1. (a) The second half of the layers has a weaker effect on future computations compared to the first. Because of the low influence on future layers in (a), but high importance for prediction (Fig. \ref{['fig:current_max_prob_change']} in the Appendix), the second half of the layers seems to perform mostly independent, but important, computations to refine the current predicted probability distribution. This is supported by the findings of Fig. \ref{['fig:kl_div']}. (b), which shows that the second half has little effect on the future tokens, indicating that they are not computing reusable subresults.
  • Figure 4: Comparing Logitlens on different layers to the final prediction. (a) KL-divergence. (b) Overlap in the top-5 predicted tokens. Both show that later layers are devoted primarily to refining the output probability distributions, rather than to performing new kind of computation.
  • Figure 5: Analyzing the direct local effects between pairs of layers of Llama 3.1 70B on GSM8k cobbe2021training. The heatmaps highlight layer pairs with direct effects on each other. Unlike Fig. \ref{['fig:layer_and_logit_effects']}, the effects are not propagated to future layers. For each layer $s$, the plot shows future layers that build on the representation computed by $s$. (a) Effects on all tokens, highlighting all possible circuits. (b) Effect on future tokens. The sparse, bright spots indicate multi-layer, multi-token mechanisms, such as induction heads. Note that interacting layers are not necessarily spatially close to each other.
  • ...and 38 more figures