Table of Contents
Fetching ...

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Qiao Sun, Liujia Yang, Wei Tang, Wei Huang, Kaixin Xu, Yongchao Chen, Mingyu Liu, Jiange Yang, Haoyi Zhu, Yating Wang, Tong He, Yilun Chen, Xili Dai, Nanyang Ye, Qinying Gu

TL;DR

PEWM introduces Primitive Embodied World Models to address data sparsity and the difficulty of long-horizon video prediction in embodied AI. By decomposing tasks into short, language-grounded primitives and leveraging a Vision-Language Model planner with Start-Goal heatmap guidance, the approach achieves fine-grained language–action alignment, data efficiency, and real-time control. The method combines a sim-real hybrid data strategy, a three-stage video-model fine-tuning pipeline, and causal distillation to produce real-time, controllable primitive rollouts, enabling zero-shot compositional generalization and scalable data synthesis for sim-to-real transfer. Empirical results on RLBench and real robot experiments show strong primitive-level generalization, robust long-horizon planning, and competitive video-generation quality with a small model, highlighting the practical potential for scalable embodied intelligence.

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

TL;DR

PEWM introduces Primitive Embodied World Models to address data sparsity and the difficulty of long-horizon video prediction in embodied AI. By decomposing tasks into short, language-grounded primitives and leveraging a Vision-Language Model planner with Start-Goal heatmap guidance, the approach achieves fine-grained language–action alignment, data efficiency, and real-time control. The method combines a sim-real hybrid data strategy, a three-stage video-model fine-tuning pipeline, and causal distillation to produce real-time, controllable primitive rollouts, enabling zero-shot compositional generalization and scalable data synthesis for sim-to-real transfer. Empirical results on RLBench and real robot experiments show strong primitive-level generalization, robust long-horizon planning, and competitive video-generation quality with a small model, highlighting the practical potential for scalable embodied intelligence.

Abstract

While video-generation-based embodied world models have gained increasing attention, their reliance on large-scale embodied interaction data remains a key bottleneck. The scarcity, difficulty of collection, and high dimensionality of embodied data fundamentally limit the alignment granularity between language and actions and exacerbate the challenge of long-horizon video generation--hindering generative models from achieving a "GPT moment" in the embodied domain. There is a naive observation: the diversity of embodied data far exceeds the relatively small space of possible primitive motions. Based on this insight, we propose a novel paradigm for world modeling--Primitive Embodied World Models (PEWM). By restricting video generation to fixed short horizons, our approach 1) enables fine-grained alignment between linguistic concepts and visual representations of robotic actions, 2) reduces learning complexity, 3) improves data efficiency in embodied data collection, and 4) decreases inference latency. By equipping with a modular Vision-Language Model (VLM) planner and a Start-Goal heatmap Guidance mechanism (SGG), PEWM further enables flexible closed-loop control and supports compositional generalization of primitive-level policies over extended, complex tasks. Our framework leverages the spatiotemporal vision priors in video models and the semantic awareness of VLMs to bridge the gap between fine-grained physical interaction and high-level reasoning, paving the way toward scalable, interpretable, and general-purpose embodied intelligence.

Paper Structure

This paper contains 50 sections, 2 theorems, 8 equations, 14 figures, 12 tables.

Key Result

Corollary 2.3

The number of distinct primitive templates (e.g., "pick", "open", "arrange") is vastly smaller than the total number of possible embodied trajectories, due to the combinatorial explosion of object, scene, and embodiment configurations.

Figures (14)

  • Figure 1: While densely distributed data can enable broad generalization, embodied data often suffer from sparsity du2024compositionalxue2025demogen. The rightmost schematic highlights how organizing embodied data at the primitive level--along orthogonal dimensions such as action and object--supports compositional generalization even under limited data availability.
  • Figure 2: Illustration of primitive-level task execution for "Pick up the yellow tape measure." 6-DoF motions are meant to be rolled out via diffusion, while discrete gripper actions are handled directly through symbolic execution. Note that this is a simple task, chosen for ease of illustration.
  • Figure 3: An analogy to highlight the compositional generalization capability of our approach.
  • Figure 4: Direct 6-DoF end-effector trajectory extraction from generated videos.
  • Figure 5: Closed-loop, autoregressive planning via iterative rollouts. The model feeds generated frames back as input, enabling continuous adaptation and long-horizon control.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 2.1: Primitive as Semantically Atomic Action Unit
  • Corollary 2.3: Compact Primitive Template Basis
  • Theorem A.4: Density of Primitive Compositions
  • proof