Table of Contents
Fetching ...

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

Francesco Pappone, Donato Crisostomi, Emanuele Rodolà

TL;DR

This work analyzes test-time compute in recurrent-depth transformers and uncovers a two-scale latent-dynamics geometry: small-scale, curved refinements inside looped blocks and slower, cross-block drift between blocks. By tracking iterate-level diagnostics such as step norms $||Δ^{(k)}||_2$ and consecutive-step angles $\cos\angle(Δ^{(k)},Δ^{(k-1)})$, it demonstrates that loop updates rapidly shrink and rotate away from previous directions, while cross-block updates accumulate more gradually. The authors introduce a decoding-free, second-order exit rule based on acceleration $a^{(k)} = ||Δ^{(k)} - Δ^{(k-1)}||_2$ with a two-hit check, which yields better latency-quality trade-offs than KL-based or first-order step-norm exits. Across three recurrent regions in a GPT-2–style model, this geometry-guided controller achieves lower latency without sacrificing performance, offering a practical approach for adaptive latent-depth compute in language models.

Abstract

Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, two-scale operational picture: (i) within a looped block, updates act as small-scale refinements; (ii) across consecutive blocks, states undergo a larger-scale drift. Across training, our measurements show that loop steps become smaller and increasingly orthogonal to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

TL;DR

This work analyzes test-time compute in recurrent-depth transformers and uncovers a two-scale latent-dynamics geometry: small-scale, curved refinements inside looped blocks and slower, cross-block drift between blocks. By tracking iterate-level diagnostics such as step norms and consecutive-step angles , it demonstrates that loop updates rapidly shrink and rotate away from previous directions, while cross-block updates accumulate more gradually. The authors introduce a decoding-free, second-order exit rule based on acceleration with a two-hit check, which yields better latency-quality trade-offs than KL-based or first-order step-norm exits. Across three recurrent regions in a GPT-2–style model, this geometry-guided controller achieves lower latency without sacrificing performance, offering a practical approach for adaptive latent-depth compute in language models.

Abstract

Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, two-scale operational picture: (i) within a looped block, updates act as small-scale refinements; (ii) across consecutive blocks, states undergo a larger-scale drift. Across training, our measurements show that loop steps become smaller and increasingly orthogonal to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.

Paper Structure

This paper contains 24 sections, 2 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Overall trajectory with inset zoom (2D PCA). Tight loop refinements (inset) vs. larger cross-block moves (main view).
  • Figure 2: Cross-block drift (DLR). DLR at boundaries $(4\!\to\!5\text{--}6)$ and $(5\text{--}6\!\to\!7)$ across checkpoints. Values $>1$ indicate larger-scale drift across blocks.
  • Figure 3: Loop dynamics. Rows show (a) step norms and (b) step angles. Within each row, panels (i)–(iii) correspond to groups 4, 5–6, and 7, respectively.
  • Figure 4: Exit policies: step-norm (blue), KL (orange), acceleration (green). Acceleration preserves quality while enabling more aggressive thresholds and lower latency; KL preserves quality but is slower; step-norm is fast but loses quality at high $\tau$.
  • Figure 5: Fixed 30-loops $\cos\angle(\Delta^{(k)},\Delta^{(k-1)})$ across checkpoints for groups 4, 5–6, 7.