Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
Qiao Sun, Liujia Yang, Wei Tang, Wei Huang, Kaixin Xu, Yongchao Chen, Mingyu Liu, Jiange Yang, Haoyi Zhu, Yating Wang, Tong He, Yilun Chen, Xili Dai, Nanyang Ye, Qinying Gu
TL;DR
PEWM introduces Primitive Embodied World Models to address data sparsity and the difficulty of long-horizon video prediction in embodied AI. By decomposing tasks into short, language-grounded primitives and leveraging a Vision-Language Model planner with Start-Goal heatmap guidance, the approach achieves fine-grained language–action alignment, data efficiency, and real-time control. The method combines a sim-real hybrid data strategy, a three-stage video-model fine-tuning pipeline, and causal distillation to produce real-time, controllable primitive rollouts, enabling zero-shot compositional generalization and scalable data synthesis for sim-to-real transfer. Empirical results on RLBench and real robot experiments show strong primitive-level generalization, robust long-horizon planning, and competitive video-generation quality with a small model, highlighting the practical potential for scalable embodied intelligence.
Abstract
While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.
