MinD: Learning A Dual-System World Model for Real-Time Planning and Implicit Risk Analysis
Xiaowei Chi, Kuangzhi Ge, Jiaming Liu, Siyuan Zhou, Peidong Jia, Zichen He, Yuzhen Liu, Tingguang Li, Lei Han, Sirui Han, Shanghang Zhang, Yike Guo
TL;DR
MinD tackles real-time planning and implicit risk analysis by learning a dual diffusion-based world model that imagines future states at low frequency (LoDiff) while guiding high-frequency control (HiDiff). The DiffMatcher co-training aligns intermediate latent representations across asynchronous diffusion processes with a diffusion-forcing loss, enabling single-step latent conditioning for fast, coherent actions. It achieves state-of-the-art results on RLBench ($63\%$) and real-world Franka tasks ($60\%$), runs at $11.3$ FPS, and provides early failure signals by analyzing latent features ($74\%$ of potential task failures identified in advance). This work introduces a practical framework for safe, explainable robotic manipulation by combining efficient latent imagination with reactive control.
Abstract
Video Generation Models (VGMs) have become powerful backbones for Vision-Language-Action (VLA) models, leveraging large-scale pretraining for robust dynamics modeling. However, current methods underutilize their distribution modeling capabilities for predicting future states. Two challenges hinder progress: integrating generative processes into feature learning is both technically and conceptually underdeveloped, and naive frame-by-frame video diffusion is computationally inefficient for real-time robotics. To address these, we propose Manipulate in Dream (MinD), a dual-system world model for real-time, risk-aware planning. MinD uses two asynchronous diffusion processes: a low-frequency visual generator (LoDiff) that predicts future scenes and a high-frequency diffusion policy (HiDiff) that outputs actions. Our key insight is that robotic policies do not require fully denoised frames but can rely on low-resolution latents generated in a single denoising step. To connect early predictions to actions, we introduce DiffMatcher, a video-action alignment module with a novel co-training strategy that synchronizes the two diffusion models. MinD achieves a 63% success rate on RL-Bench, 60% on real-world Franka tasks, and operates at 11.3 FPS, demonstrating the efficiency of single-step latent features for control signals. Furthermore, MinD identifies 74% of potential task failures in advance, providing real-time safety signals for monitoring and intervention. This work establishes a new paradigm for efficient and reliable robotic manipulation using generative world models.
