Interactive World Simulator for Robot Policy Training and Evaluation

Yixuan Wang; Rhythm Syed; Fangyu Wu; Mengchao Zhang; Aykut Onol; Jose Barreiros; Hooshang Nayyeri; Tony Dear; Huan Zhang; Yunzhu Li

Interactive World Simulator for Robot Policy Training and Evaluation

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, Yunzhu Li

TL;DR

Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, this work establishes the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.

Abstract

Action-conditioned video prediction models (often referred to as world models) have shown strong potential for robotics applications, but existing approaches are often slow and struggle to capture physically consistent interactions over long horizons, limiting their usefulness for scalable robot policy training and evaluation. We present Interactive World Simulator, a framework for building interactive world models from a moderate-sized robot interaction dataset. Our approach leverages consistency models for both image decoding and latent-space dynamics prediction, enabling fast and stable simulation of physical interactions. In our experiments, the learned world models produce interaction-consistent pixel-level predictions and support stable long-horizon interactions for more than 10 minutes at 15 FPS on a single RTX 4090 GPU. Our framework enables scalable demonstration collection solely within the world models to train state-of-the-art imitation policies. Through extensive real-world evaluation across diverse tasks involving rigid objects, deformable objects, object piles, and their interactions, we find that policies trained on world-model-generated data perform comparably to those trained on the same amount of real-world data. Additionally, we evaluate policies both within the world models and in the real world across diverse tasks, and observe a strong correlation between simulated and real-world performance. Together, these results establish the Interactive World Simulator as a stable and physically consistent surrogate for scalable robotic data generation and faithful, reproducible policy evaluation.

Interactive World Simulator for Robot Policy Training and Evaluation

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 7 figures, 1 table)

This paper contains 19 sections, 5 equations, 7 figures, 1 table.

Introduction
Related Works
Video Prediction Model for Robotic Manipulation
Imitation Policy Training
Imitation Policy Evaluation
Method
Problem Formulation
Interactive World Simulator
Stage 1: Autoencoder Training
Stage 2: Dynamics Training
Inference
Data Generation for Policy Training
Policy Evaluation
Experiment
World Model Instantiation
...and 4 more sections

Figures (7)

Figure 1: Overview of the Interactive World Simulator. Given real-world robot interaction datasets (left), we train action-conditioned video models that capture complex physical dynamics and support stable long-horizon interactions (middle). The resulting world models achieve stable video prediction for over 10 minutes at 15 FPS on a single RTX 4090 GPU, outperforming prior world models in long-horizon realism. The resulting Interactive World Simulator enables scalable data generation for imitation policy training without requiring additional robot interaction and supports faithful and reproducible policy evaluation, exhibiting a strong correlation between policy performance in simulation and in the real world (right). Additional examples are available on the https://www.yixuanwang.me/interactive_world_sim/.
Figure 2: Method Overview. Model training proceeds in two stages. In Stage 1, we train an autoencoder that maps RGB observations into compact 2D latent representations using a CNN encoder $E_{\phi}{}$ and a consistency-model decoder $D_{\theta}{}$, enabling efficient and high-fidelity reconstruction. In Stage 2, we freeze the autoencoder and train an action-conditioned consistency model $F_{\psi}{}$ in latent space to model future visual dynamics using next-frame supervision. We choose $F_{\psi}{}$ as a consistency model because it efficiently represents multimodal distributions over possible future outcomes. The dynamics model learns to denoise noisy latent states conditioned on robot actions, past latents, and noise levels, and is instantiated as a stack of 3D convolutional blocks with FiLM modulation and spatiotemporal attention. During inference, $F_{\psi}{}$ autoregressively predicts future latents, which are decoded by $D_{\theta}{}$ to generate long-horizon video predictions.
Figure 3: Qualitative Comparison of Long-Horizon Action-Conditioned Video Prediction. We compare the proposed Interactive World Simulator with state-of-the-art action-conditioned video prediction models across multiple manipulation tasks. (a) Intermediate predicted frames for the Pile Sweeping task illustrate long-horizon rollout behavior, where baseline methods exhibit inaccurate dynamics, robot pose drift, accumulated artifacts, or loss of fine-grained details over time. (b) Terminal predicted frames for additional tasks, including Box Packing, Rope Routing, T Pushing, Rope Collecting, and Mug Grasping, highlight differences in visual fidelity and interaction consistency. In contrast to baseline methods, our model preserves coherent robot–object interactions and maintains stable predictions over extended horizons.
Figure 4: Teleoperation Interface for Real-Time Interaction with the World Simulator. We build a kinematic teleoperation device that allows users to control robot actions through the Interactive World Simulator. The left column shows a user issuing commands through the teleoperation device (e.g., moving forward, opening the gripper, and lifting the end-effector). The right columns show the corresponding frames generated by the world simulator in real time. The simulator follows the teleoperation commands and produces consistent robot–object interactions.
Figure 5: Imitation Policy Training with Mixtures of Real-World Data and Our World Simulator Data. "WS %" represents the percentage of data from our world simulator, and "Real %" represents the percentage of data from the real world. From left to right, the mixtures range from 100% world simulator data to 100% real-world data. We use a total of 100 training episodes for each policy and data mixture. Performance among the policies trained on different data mixtures across different manipulation tasks is observed to be consistently high with comparable task scores, suggesting that data generated by our world simulator is comparable in quality to real-world demonstrations.
...and 2 more figures

Interactive World Simulator for Robot Policy Training and Evaluation

TL;DR

Abstract

Interactive World Simulator for Robot Policy Training and Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)