Table of Contents
Fetching ...

From Growing to Looping: A Unified View of Iterative Computation in LLMs

Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer

TL;DR

The paper investigates looping and depth growth as two architectures for embedding iterative computation in LLMs to enhance reasoning beyond simple parameter scaling. By unifying these approaches through mechanistic depth-wise diagnostics and residual-stream analyses, it demonstrates convergent signatures such as increased late-layer usage and block-aligned sublayer patterns, supporting a common iterative computation mechanism. It shows these methods are adaptable and composable: inference-time looping can boost depth-grown models by up to $2\times$ on certain reasoning tasks, and higher-quality cooldown data further amplifies gains, especially for depth-grown systems. The practical takeaway is that growing first and looping later yields a robust, scalable recipe for improving reasoning, with retrofitted recurrence extending these benefits under realistic budgets.

Abstract

Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

From Growing to Looping: A Unified View of Iterative Computation in LLMs

TL;DR

The paper investigates looping and depth growth as two architectures for embedding iterative computation in LLMs to enhance reasoning beyond simple parameter scaling. By unifying these approaches through mechanistic depth-wise diagnostics and residual-stream analyses, it demonstrates convergent signatures such as increased late-layer usage and block-aligned sublayer patterns, supporting a common iterative computation mechanism. It shows these methods are adaptable and composable: inference-time looping can boost depth-grown models by up to on certain reasoning tasks, and higher-quality cooldown data further amplifies gains, especially for depth-grown systems. The practical takeaway is that growing first and looping later yields a robust, scalable recipe for improving reasoning, with retrofitted recurrence extending these benefits under realistic budgets.

Abstract

Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to , despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.
Paper Structure (30 sections, 1 equation, 21 figures, 3 tables)

This paper contains 30 sections, 1 equation, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Trade-offs for looped and depth-grown models. Each point corresponds to a model in \ref{['tab:exp:main_results']} (up to 1.7B parameters), plotted by average Reasoning Primitives accuracy versus unique parameters (left), inference FLOPs (middle), and training FLOPs (right). Looped and depth-grown models improve accuracy in reasoning primitives over standard baselines, suggesting a shared inductive bias toward better reasoning. Looped models improve reasoning under fixed parameter budgets and can be competitive under fixed inference budgets, while depth-grown models reach similar or better reasoning with less training compute. \ref{['app:fig:summary_complete']} shows additional benchmark categories.
  • Figure 1: Performance comparison of standard transformer baselines, looped models, and two depth-grown models MIDASsaunshi2024inductive and LIDASkapl2025depth at 360M and 1.7B base model sizes. Looped models often outperform iso-param baselines and are competitive with iso-inference baselines, especially for reasoning-heavy task categories such as Open-book Q&A, Math Word Problems and Reasoning Primitives. Depth-grown models match the baselines across most task categories with roughly $80\%$ of the pre-training compute, while outperforming them on reasoning. This suggests a shared inductive bias toward reasoning for looped and depth-grown models. Best performance per model size in bold and looped model rows in gray.
  • Figure 2: Looped and depth-grown models use later layers more. We compare Baseline, LIDAS, $\mathop{\mathrm{Loop}}\nolimits\,(4{\mkern-1.5mu\times\mkern-1.5mu}6)$ and $\mathop{\mathrm{Loop}}\nolimits\,(4{\mkern2mu\text{-}\mkern2mu}4{\mkern-1.5mu\times\mkern-1.5mu}4{\mkern2mu\text{-}\mkern2mu}4)$ on (A) depth score, (B) top-5 vocabulary overlap on GSM8K and (C) Tuned Lens early-exit normalized accuracy on the Variable Assignment Math reasoning primitive. All three diagnostics imply higher usage of later layers for the grown LIDAS and looped models.
  • Figure 2: Nemotron-CC-Math-4+ leads to highest reasoning gains. We ablate the effect of increasing the proportion and quality of math tokens (from 6% OpenWebMath in gray) during the cooldown on reasoning performance. Both FineMath-4+ (FMT) and Nemotron-CC-Math-4+ (NMT) increase the performance on Math Word Problems, Reasoning Primitives and GSM8K.
  • Figure 3: Looped and depth-grown models exhibit similar (sub)layer usage.LIDAS, $\mathop{\mathrm{Loop}}\nolimits\,(4{\mkern-1.5mu\times\mkern-1.5mu}6)$ and $\mathop{\mathrm{Loop}}\nolimits\,(4{\mkern2mu\text{-}\mkern2mu}4{\mkern-1.5mu\times\mkern-1.5mu}4{\mkern2mu\text{-}\mkern2mu}4)$ share a slower residual norm growth than the baseline and exhibit periodic attention-sublayer contributions (ratio of norms for attention sublayer output over residual) with a 4-layer cycle, matching the block size of LIDAS and the size of the recurrent block.
  • ...and 16 more figures