Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis
Ferdinand Kapl, Emmanouil Angelis, Tobias Höppe, Kaitlin Maile, Johannes von Oswald, Nino Scherrer, Stefan Bauer
TL;DR
The paper investigates whether gradual depth growth with MIDAS and the proposed LIDAS can overcome the Curse of Depth in Transformers. It formalizes a depth-growth operator, introduces a middle-layer duplication strategy, and empirically shows improved reasoning performance and training efficiency over conventional training and LayerNorm-Scaling on SmolLM backbones. Depth analyses reveal that grown models utilize depth more effectively, form permutable mid-network blocks, and exhibit cyclical intra-block dynamics that differ from non-grown baselines. The lightweight LIDAS variant achieves symmetric weight structures and stronger central-layer engagement, offering robust gains in reasoning benchmarks while maintaining language modeling performance. Collectively, the work provides a mechanistic account of how gradual depth growth reshapes computation to enable deeper, more efficient reasoning circuits.
Abstract
Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
