Two-Scale Latent Dynamics for Recurrent-Depth Transformers

Francesco Pappone; Donato Crisostomi; Emanuele Rodolà

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

Francesco Pappone, Donato Crisostomi, Emanuele Rodolà

TL;DR

This work analyzes test-time compute in recurrent-depth transformers and uncovers a two-scale latent-dynamics geometry: small-scale, curved refinements inside looped blocks and slower, cross-block drift between blocks. By tracking iterate-level diagnostics such as step norms $||Δ^{(k)}||_2$ and consecutive-step angles $\cos\angle(Δ^{(k)},Δ^{(k-1)})$, it demonstrates that loop updates rapidly shrink and rotate away from previous directions, while cross-block updates accumulate more gradually. The authors introduce a decoding-free, second-order exit rule based on acceleration $a^{(k)} = ||Δ^{(k)} - Δ^{(k-1)}||_2$ with a two-hit check, which yields better latency-quality trade-offs than KL-based or first-order step-norm exits. Across three recurrent regions in a GPT-2–style model, this geometry-guided controller achieves lower latency without sacrificing performance, offering a practical approach for adaptive latent-depth compute in language models.

Abstract

Recurrent-depth transformers scale test-time compute by iterating latent computations before emitting tokens. We study the geometry of these iterates and argue for a simple, two-scale operational picture: (i) within a looped block, updates act as small-scale refinements; (ii) across consecutive blocks, states undergo a larger-scale drift. Across training, our measurements show that loop steps become smaller and increasingly orthogonal to one another, indicating better local modeling of fine structure rather than merely pushing in a single direction. These dynamics motivate an early-exit mechanism based on the model's second-order difference in step-size, which we show is superior in terms of performance, stability and time-efficiency, when compared to the KL-divergence exit strategy of Geiping et al. and its naive first-order counterpart.

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

TL;DR

Abstract

Two-Scale Latent Dynamics for Recurrent-Depth Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)