Table of Contents
Fetching ...

FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

Xiaoxu Xu, Hao Li, Jinhui Ye, Yilun Chen, Jia Zeng, Xinyi Chen, Linning Xu, Dahua Lin, Weixin Li, Jiangmiao Pang

TL;DR

FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors, and employs a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures.

Abstract

Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling. To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves VLA frameworks.

FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

TL;DR

FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors, and employs a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures.

Abstract

Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling. To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves VLA frameworks.
Paper Structure (18 sections, 10 equations, 10 figures, 15 tables, 1 algorithm)

This paper contains 18 sections, 10 equations, 10 figures, 15 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of future guidance paradigms for VLA models.(a) Explicit guidance predicts future video frames. (b) Implicit guidance learns latent vectors to reconstruct changes between sparsely sampled frames. (c) Ours processes continuous multi-frame clips and structurally decouples the latent representation into a visual stream and a motor stream. This visually-conditioned decoupling extracts joint visuomotor embeddings, yielding consistent performance improvements across diverse benchmarks.
  • Figure 2: Overview of the FutureVLA framework. (a) Joint Visuomotor Pretraining: Continuous video clips are processed by a frozen 3D-VAE into temporal tokens and structurally decoupled into two streams. Visual tokens reconstruct the initial frame $O_t$, while motor tokens, supervised by action chunks, utilize the Joint Visuomotor Gating module (c) based on gated cross-attention, where the motor stream iteratively queries spatial affordances from the visual tokens, yielding physically grounded joint visuomotor embeddings. (b) Joint Visuomotor Embedding Guided VLA Post-training: The frozen model provides joint visuomotor embeddings as future-aware temporal priors. Through latent embedding alignment, the downstream VLA's intermediate representations are forced to internalize these dynamics.
  • Figure 3: Visualization in (a) and performance comparison in (b) for four real-robot manipulation tasks. In (a), from top to bottom: (1) Make a burger: The robot first picks up the burger slice on the right and places it on the bread, then picks up the bread slice on the left and places it on top of the burger slice. (2) Insert roses into a pot: The robot picks up the roses from the table and inserts them into the pot. (3) Scoop beans with a spoon: The robot grasps the spoon in the center, scoops beans from the left container, and transfers them into the bowl on the right. (4) Erase handwriting on a whiteboard: The robot erases the black markings on the whiteboard.
  • Figure 4: Effect of different temporal sampling density on joint visuomotor learning, based on the WidowX robot.
  • Figure 5: Performance of joint visuomotor learning with varying temporal horizon length on the WidowX robot.
  • ...and 5 more figures