FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

Jingjing Fan; Yushan Liu; Shoujie Li; Botao Ren; Siyuan Li; Xiao-Ping Zhang; Wenbo Ding; Zhidong Deng

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

Jingjing Fan, Yushan Liu, Shoujie Li, Botao Ren, Siyuan Li, Xiao-Ping Zhang, Wenbo Ding, Zhidong Deng

TL;DR

FUTURE-VLA tackles the latency and context bottlenecks of long-horizon robotic control by unifying perception and prediction through constrained spatiotemporal compression and latent-space autoregression. It introduces a dual-sided efficiency paradigm with Temporally Adaptive Cascaded Compression on inputs and Spectral Action Tokenization + Compact 1D Visual Tokenization on outputs, enabling joint generation of executable action chunks and future-look visuals in a single forward pass. The approach supports a prediction-guided Human-In-The-Loop through dynamic execution gating and resampling, improving safety and robustness across diverse tasks. Empirical results on LIBERO, RoboTwin, and a real Piper platform show up to a $16\times$ extension in the spatiotemporal window while maintaining single-frame latency, establishing strong baselines for embodied intelligence in real-time settings.

Abstract

General vision-language models increasingly support unified spatiotemporal reasoning over long video streams, yet deploying such capabilities on robots remains constrained by the prohibitive latency of processing long-horizon histories and generating high-dimensional future predictions. To bridge this gap, we present FUTURE-VLA, a unified architecture that reformulates long-horizon control and future forecasting as a monolithic sequence-generation task. Adopting a dual-sided efficiency paradigm, FUTURE-VLA leverages a temporally adaptive compression strategy to maximize spatiotemporal information density, enabling the ingestion of extensive multi-view histories while maintaining constant inference latency. Simultaneously, it performs latent-space autoregression to align actionable dynamics with reviewable visual look-aheads in a single forward pass. These real-time predictive capabilities further enable a prediction-guided Human-In-the-Loop mechanism via interactive execution gating, allowing operators to dynamically validate behaviors based on interpretable future previews. Extensive evaluations demonstrate that FUTURE-VLA establishes new state-of-the-art performance, attaining success rates of 99.2% on LIBERO, 75.4% on RoboTwin, and 78.0% on a real-world Piper platform, all with a $16\times$ extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

TL;DR

extension in the spatiotemporal window while maintaining single-frame latency, establishing strong baselines for embodied intelligence in real-time settings.

Abstract

extended spatiotemporal window while maintaining the inference latency of a single-frame baseline.

Paper Structure (43 sections, 4 equations, 18 figures, 5 tables)

This paper contains 43 sections, 4 equations, 18 figures, 5 tables.

Introduction
Related Works
Visual-Language-Action
World Models for Robotics
Method
Model Architecture
Visual Encoding & Cascaded Compression
Unified Tokenization
Spectral Action Tokenization
Compact 1D Visual Tokenization
Inference: Predictive Look-ahead
Experiments
Benchmarks and Experimental Setup
Simulation Benchmarks
Real-world Experiments
...and 28 more sections

Figures (18)

Figure 1: Comparison of VLA-WM Architectures. (a) Modular Fragmentation: Independent VLA and World Model operating with decoupled representations. (b) Instantaneous Unification: A unified framework integrating perception and prediction within a short-horizon temporal window. (c) FUTURE-VLA (Ours): A spatiotemporally unified architecture that synchronously generates action chunks and future previews via latent-space autoregression in a single forward pass.
Figure 2: The architecture of FUTURE-VLA. On the input side, multi-view historical observations are encoded via a frozen DINOv3 encoder and processed through Temporally Adaptive Cascaded Compression to maximize information density under a fixed token budget. On the output side, the model autoregressively generates action chunks (via FAST spectral tokenization) and future visual predictions (via compact 1D tokenization with 32 tokens per frame) in a single forward pass.
Figure 3: HIL Closed-Loop Execution with Resampling. FUTURE-VLA jointly generates action chunks and future visual previews, enabling a verifier to determine a safe execution horizon $k$ via Dynamic Gating or reject erroneous proposals ($k{=}0$) and trigger Resampling Recovery with increased temperature to escape deadlocks.
Figure 4: Multi-view rollout visualization in LIBERO and RoboTwin. Top: executed rollout keyframes (ground truth). Bottom: predicted future frames (time left to right).
Figure 5: Multi-view rollout visualization in the real-world setup. Top: executed rollout keyframes (ground truth). Bottom: predicted future frames (time left to right).
...and 13 more figures

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

TL;DR

Abstract

FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution

Authors

TL;DR

Abstract

Table of Contents

Figures (18)