Two-Scale Latent Dynamics for Recurrent-Depth Transformers
Francesco Pappone, Donato Crisostomi, Emanuele Rodolà
TL;DR
This work analyzes test-time compute in recurrent-depth transformers and uncovers a two-scale latent-dynamics geometry: small-scale, curved refinements inside looped blocks and slower, cross-block drift between blocks. By tracking iterate-level diagnostics such as step norms $||Δ^{(k)}||_2$ and consecutive-step angles $\cos\angle(Δ^{(k)},Δ^{(k-1)})$, it demonstrates that loop updates rapidly shrink and rotate away from previous directions, while cross-block updates accumulate more gradually. The authors introduce a decoding-free, second-order exit rule based on acceleration $a^{(k)} = ||Δ^{(k)} - Δ^{(k-1)}||_2$ with a two-hit check, which yields better latency-quality trade-offs than KL-based or first-order step-norm exits. Across three recurrent regions in a GPT-2–style model, this geometry-guided controller achieves lower latency without sacrificing performance, offering a practical approach for adaptive latent-depth compute in language models.
Abstract
Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, two-scale operational picture: (i) within a looped block, updates act as small-scale refinements; (ii) across consecutive blocks, states undergo a larger-scale drift. Across training, our measurements show that loop steps become smaller and increasingly orthogonal to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.
