Table of Contents
Fetching ...

MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis

Xiaowei Chi, Kuangzhi Ge, Jiaming Liu, Siyuan Zhou, Peidong Jia, Zichen He, Yuzhen Liu, Tingguang Li, Lei Han, Sirui Han, Shanghang Zhang, Yike Guo

TL;DR

MinD tackles real-time planning and implicit risk analysis by learning a dual diffusion-based world model that imagines future states at low frequency (LoDiff) while guiding high-frequency control (HiDiff). The DiffMatcher co-training aligns intermediate latent representations across asynchronous diffusion processes with a diffusion-forcing loss, enabling single-step latent conditioning for fast, coherent actions. It achieves state-of-the-art results on RLBench ($63\%$) and real-world Franka tasks ($60\%$), runs at $11.3$ FPS, and provides early failure signals by analyzing latent features ($74\%$ of potential task failures identified in advance). This work introduces a practical framework for safe, explainable robotic manipulation by combining efficient latent imagination with reactive control.

Abstract

Video Generation Models (VGMs) have become powerful backbones for Vision-Language-Action (VLA) models, leveraging large-scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame-by-frame video diffusion is computationally inefficient for real-time robotics. To address these, we propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning. MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel co-training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single-step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real-time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.

MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis

TL;DR

MinD tackles real-time planning and implicit risk analysis by learning a dual diffusion-based world model that imagines future states at low frequency (LoDiff) while guiding high-frequency control (HiDiff). The DiffMatcher co-training aligns intermediate latent representations across asynchronous diffusion processes with a diffusion-forcing loss, enabling single-step latent conditioning for fast, coherent actions. It achieves state-of-the-art results on RLBench () and real-world Franka tasks (), runs at FPS, and provides early failure signals by analyzing latent features ( of potential task failures identified in advance). This work introduces a practical framework for safe, explainable robotic manipulation by combining efficient latent imagination with reactive control.

Abstract

Video Generation Models (VGMs) have become powerful backbones for Vision-Language-Action (VLA) models, leveraging large-scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame-by-frame video diffusion is computationally inefficient for real-time robotics. To address these, we propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning. MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel co-training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single-step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real-time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.

Paper Structure

This paper contains 59 sections, 10 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Decoded frames from 50 DDIM steps vs. a single DDPM step. Instead of generating full video frames, extracting latent features from a single diffusion step provides a semantically rich and computationally efficient representation. This single-step representation is not only fast to generate but also sufficiently informative to support downstream tasks such as VLA policy execution and failure prediction.
  • Figure 2: We present Manipulate inDream(MinD): A video-action unified generation world model that can manipulate, imagine, and simulate. MinD integrates LoDiff-Visual for low-frequency video generation and HiDiff-Policy for high-frequency action planning. A dynamic feature adapter, DiffMatcher, bridges motion features between the two systems, ensuring consistency across video and action.
  • Figure 3: MinD framework overview. The MinD framework comprises three core components. i. Dual asynchronous diffusion models, where "slow" LoDiff-Visual produces future visual latents for long-latency scenes, ii. "fast" HiDiff-Policy outputs high-frequency actions. iii. DiffMatcher Module bridging visual and action modalities. While training, a co-training strategy employing a diffusion-forcing loss for DiffMatcher to learn mappings robust to different noise levels. We first pretrain the foundation model and the adapter on multiple robot pretraining datasets and then finetune the model on downstream tasks, including RL-Bench simulation and a real-world Franka robot.
  • Figure 4: This figure showcases the consistency between the future imagined by our LoDiff video generator (bottom rows) and the final trajectory executed by the HiDiff policy (top rows). This high-fidelity alignment across both RLBench simulation and real-world Franka tasks validates MinD's capability as an effective world model.
  • Figure 5: Evaluation of video generation predictions. The left panel visualizes failing cases (left) with misaligned generated video clips and corresponding successful cases (right) with accurate predictions. The middle panel shows the confusion matrix of our human evaluation. The right panel showcases the PCA result of the single-step predictive visual feature.
  • ...and 8 more figures