Table of Contents
Fetching ...

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

Ferdinand Kapl, Emmanouil Angelis, Tobias Höppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer

TL;DR

The paper investigates whether gradual depth growth with MIDAS and the proposed LIDAS can overcome the Curse of Depth in Transformers. It formalizes a depth-growth operator, introduces a middle-layer duplication strategy, and empirically shows improved reasoning performance and training efficiency over conventional training and LayerNorm-Scaling on SmolLM backbones. Depth analyses reveal that grown models utilize depth more effectively, form permutable mid-network blocks, and exhibit cyclical intra-block dynamics that differ from non-grown baselines. The lightweight LIDAS variant achieves symmetric weight structures and stronger central-layer engagement, offering robust gains in reasoning benchmarks while maintaining language modeling performance. Collectively, the work provides a mechanistic account of how gradual depth growth reshapes computation to enable deeper, more efficient reasoning circuits.

Abstract

Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

TL;DR

The paper investigates whether gradual depth growth with MIDAS and the proposed LIDAS can overcome the Curse of Depth in Transformers. It formalizes a depth-growth operator, introduces a middle-layer duplication strategy, and empirically shows improved reasoning performance and training efficiency over conventional training and LayerNorm-Scaling on SmolLM backbones. Depth analyses reveal that grown models utilize depth more effectively, form permutable mid-network blocks, and exhibit cyclical intra-block dynamics that differ from non-grown baselines. The lightweight LIDAS variant achieves symmetric weight structures and stronger central-layer engagement, offering robust gains in reasoning benchmarks while maintaining language modeling performance. Collectively, the work provides a mechanistic account of how gradual depth growth reshapes computation to enable deeper, more efficient reasoning circuits.

Abstract

Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.

Paper Structure

This paper contains 43 sections, 4 equations, 35 figures, 9 tables.

Figures (35)

  • Figure 1: Depth-grown models use their depth more (1.7B). (A) Depth score csordas2025language on MATH hendrycks2021measuring and MQuAKE zhong2023mquake. Grown models (MIDAS, LIDAS) have consistently higher depth scores. (B) Top-5 overlap between each layer’s early-exit vocabulary and model’s final vocabulary on 20 prompts from GSM8K cobbe2021training. Both grown models studied in this work exhibit lower overlap at later layers, indicating that these later layers still contribute additional features necessary for the final prediction. (C) Early-exit relative accuracy versus layer on Variable Assignment Math reasoning primitive. The baseline reaches near its final performance early, whereas accuracy for MIDAS and LIDAS continues to rise up to the last layer. Using these metrics, however, LN-Scaling shows no discernible benefit over the baseline in depth utilisation.
  • Figure 1: Performance comparison of a standard transformer baseline, LayerNorm-Scaling, and the two grown models MIDAS and LIDAS. We reproduce the findings of saunshi2024inductive and observe that grown models match the baseline in training objective (NLL), standard Q&A benchmarks as well as Lambada. Grown models, especially LIDAS, outperform the non-grown baseline on reasoning-heavy tasks such as Math Word and Primitives. LN-Scaling on the other hand, achieves only minor improvements, which diminish when scaling to the larger model.
  • Figure 2: Illustration of growing strategies with block size 4: MIDAS vs. LIDAS, with an even number of existing blocks. MIDASsaunshi2024inductive simply copies $B^{\prime}=B_m$, which is the block preceding mid-depth. When seen from a block-wise perspective instead of a layer-wise perspective, our proposed variant LIDAS may be interpreted as forming $B^{\prime}$ from the two blocks surrounding the mid-depth by combining the first two layers of $B_{m+1}$ with the last two layers of $B_m$. This small difference in initialization leads to significantly improved performance as shown in \ref{['tab:exp:results']}.
  • Figure 3: Effect of swapping blocks of layers on Lambada (top row) and the reasoning primitive Variable Assignment Math (bottom row).MIDAS is more robust to interventions for larger blocks in the middle of the network: the degradation in performance for MIDAS is much smaller for swapping blocks of larger sizes $\{2, 4, 8\}$ compared to the baseline, especially for Lambada. In Appendix \ref{['fig:apx:big_swap']}, we present results including LIDAS.
  • Figure 3: Open-book QA Benchmarks
  • ...and 30 more figures