Table of Contents
Fetching ...

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

TL;DR

MVISTA-4D tackles the gap between image-based world models and geometry-aware manipulation by learning a view-consistent 4D RGB-D world model that can imagine futures from a single observation and multiple viewpoints. The method couples explicit cross-view and cross-modality fusion with a trajectory-level latent for actions, enabling test-time backpropagation to infer executable action sequences, further refined by a residual inverse dynamics module for execution. Empirical results on RLBench, RoboTwin, and a real-robot dataset show improved 4D scene generation and downstream manipulation, with ablations clarifying the importance of multi-view geometry, trajectory conditioning, and IDM residuals. The work advances embodied world modeling by delivering geometry-consistent futures that support robust action inference under occlusion and partial observability.

Abstract

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

TL;DR

MVISTA-4D tackles the gap between image-based world models and geometry-aware manipulation by learning a view-consistent 4D RGB-D world model that can imagine futures from a single observation and multiple viewpoints. The method couples explicit cross-view and cross-modality fusion with a trajectory-level latent for actions, enabling test-time backpropagation to infer executable action sequences, further refined by a residual inverse dynamics module for execution. Empirical results on RLBench, RoboTwin, and a real-robot dataset show improved 4D scene generation and downstream manipulation, with ablations clarifying the importance of multi-view geometry, trajectory conditioning, and IDM residuals. The work advances embodied world modeling by delivering geometry-consistent futures that support robust action inference under occlusion and partial observability.

Abstract

World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.
Paper Structure (24 sections, 11 equations, 15 figures, 9 tables)

This paper contains 24 sections, 11 equations, 15 figures, 9 tables.

Figures (15)

  • Figure 1: The overview of our main pipeline.
  • Figure 2: Qualitative Results of 4D Generation on RoboTwin dataset. Red, green, and blue boxes represent different viewpoints.
  • Figure 3: Generated Geometries on Real-World Robot dataset
  • Figure 4: Effect of geometry-aware cross view modeling
  • Figure 5: Effect of explicit cross-modality modeling
  • ...and 10 more figures