Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur; Jalal Naghiyev; Haoquan Fang; Wei-Chuan Tsai; Jiafei Duan; Dieter Fox; Ranjay Krishna

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, Ranjay Krishna

TL;DR

RD-VLA presents latent iterative reasoning to decouple test-time compute from fixed architectural depth in vision-language-action robots. By using a weight-tied recurrent core that refines a latent scratchpad across $r$ iterations and adaptive stopping, the approach achieves scalable compute with constant memory, guided by convergence of the predicted actions. The method yields state-of-the-art results on LIBERO and CALVIN benchmarks and demonstrates strong real-world robustness on a bimanual manipulator, with adaptive strategies reducing average compute while maintaining performance. This latent-space reasoning paradigm offers practical compute-speedups and a framework for uncertainty-aware adaptive execution in embodied AI systems.

Abstract

Current Vision-Language-Action (VLA) models rely on fixed computational depth, expending the same amount of compute on simple adjustments and complex multi-step manipulation. While Chain-of-Thought (CoT) prompting enables variable computation, it scales memory linearly and is ill-suited for continuous action spaces. We introduce Recurrent-Depth VLA (RD-VLA), an architecture that achieves computational adaptivity via latent iterative refinement rather than explicit token generation. RD-VLA employs a recurrent, weight-tied action head that supports arbitrary inference depth with a constant memory footprint. The model is trained using truncated backpropagation through time (TBPTT) to efficiently supervise the refinement process. At inference, RD-VLA dynamically allocates compute using an adaptive stopping criterion based on latent convergence. Experiments on challenging manipulation tasks show that recurrent depth is critical: tasks that fail entirely (0 percent success) with single-iteration inference exceed 90 percent success with four iterations, while simpler tasks saturate rapidly. RD-VLA provides a scalable path to test-time compute in robotics, replacing token-based reasoning with latent reasoning to achieve constant memory usage and up to 80x inference speedup over prior reasoning-based VLA models. Project page: https://rd-vla.github.io/

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

TL;DR

iterations and adaptive stopping, the approach achieves scalable compute with constant memory, guided by convergence of the predicted actions. The method yields state-of-the-art results on LIBERO and CALVIN benchmarks and demonstrates strong real-world robustness on a bimanual manipulator, with adaptive strategies reducing average compute while maintaining performance. This latent-space reasoning paradigm offers practical compute-speedups and a framework for uncertainty-aware adaptive execution in embodied AI systems.

Abstract

Paper Structure (23 sections, 9 equations, 7 figures, 3 tables)

This paper contains 23 sections, 9 equations, 7 figures, 3 tables.

Introduction
Related Work
Vision-Language-Action Models
Reasoning and Efficient-Compute VLA Models
Recurrent Transformers
Method
Architectural Backbone and Token Flow
Recurrent-Depth Architecture
Latent Iterative Reasoning via Input Injection
Coda and Action Projection
Training with Randomized Recurrence
Adaptive Computation
Adaptive Execution
Threshold-Based Adaptive Execution
Linear Decay Execution
...and 8 more sections

Figures (7)

Figure 1: Recurrent-Depth VLA. (Left) Previous reasoning VLAs (e.g., ThinkAct, MolmoAct) generate explicit reasoning tokens in output space, requiring expensive autoregressive decoding. (Center) Our approach performs iterative refinement entirely in latent representation space, bypassing token generation overhead. (Right) RD-VLA achieves comparable performance to autoregressive reasoning baselines on LIBERO-10 while being substantially faster due to the efficiency of latent reasoning with adaptive compute.
Figure 2: Recurrent-Depth VLA Architecture. The Prelude (P) grounds learned queries via cross-attention to mid-layer VLM features. The weight-tied Recurrent Core (R) iteratively refines a noisy latent scratchpad over $K$ iterations, cross-attending to final-layer VLM representations and proprioception. The Coda (C) decodes the converged state into actions. Recurrence depth $K$ adapts dynamically at inference based on task complexity.
Figure 3: Case study for adaptive computation. In a LIBERO rollout, the model dynamically selects different numbers of iterations before terminating, depending on the execution state. It uses fewer iterations (7–9) at steps 1 and 30, which correspond to simpler motions like navigation and placing, and more iterations (about 14) at steps 10 and 25, where the actions are more complex, such as grasping.
Figure 4: Performance across LIBERO benchmarks for different numbers of recurrences. All task categories show consistent improvement with increased computational depth, with models converging between 8--12 iterations on average.
Figure 5: Performance on selected 5 Long tasks across recurrence steps. Each task exhibits distinct convergence behavior: Task 4 jumps from 6% at iteration 1 to nearly 80% at iteration 2, while Task 5 remains at 0% through iteration 2 and only reaches $\sim$70% at iteration 3. This demonstrates the task-dependent and emergent adaptive behavior of our model.
...and 2 more figures

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

TL;DR

Abstract

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)