Table of Contents
Fetching ...

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, Zheng Zhu

TL;DR

GigaWorld-0 tackles the data bottleneck in embodied AI by unifying photorealistic video generation with geometry- and physics-consistent 3D scene synthesis into a scalable data engine. It couples GigaWorld-0-Video (Dreamer and post-hoc adapters) with GigaWorld-0-3D (foreground/background assets, differentiable physics, and action generation) under a memory- and compute-efficient training pipeline (GigaTrain with FP8 and sparse attention). The framework yields diverse, instruction-aligned data that significantly boosts real-world VLA policy performance, demonstrated by superior benchmarks and strong transfer to real robots without real-world interaction during training. These results position world models as practical data engines for embodied AI and set the stage for interactive, self-improving policy learning driven by synthetic data.

Abstract

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

TL;DR

GigaWorld-0 tackles the data bottleneck in embodied AI by unifying photorealistic video generation with geometry- and physics-consistent 3D scene synthesis into a scalable data engine. It couples GigaWorld-0-Video (Dreamer and post-hoc adapters) with GigaWorld-0-3D (foreground/background assets, differentiable physics, and action generation) under a memory- and compute-efficient training pipeline (GigaTrain with FP8 and sparse attention). The framework yields diverse, instruction-aligned data that significantly boosts real-world VLA policy performance, demonstrated by superior benchmarks and strong transfer to real robots without real-world interaction during training. These results position world models as practical data engines for embodied AI and set the stage for interactive, self-improving policy learning driven by synthetic data.

Abstract

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

Paper Structure

This paper contains 20 sections, 6 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: The framework of GigaWorld-0-Video-Dreamer.
  • Figure 2: Qualitative comparison of action inference on the test set. Predicted joint trajectories from GigaWorld-0-IDM closely align with ground-truth actions across all 12 arm joints and 2 gripper degrees of freedom, demonstrating high fidelity in recovering physically plausible manipulation policies from visual input alone.
  • Figure 3: The control branch of GigaWorld-Video.
  • Figure 4: Training data pair of GigaWorld-0-Video-ViewTransfer.
  • Figure 5: Training data pair of GigaWorld-0-Video-MimicTransfer.
  • ...and 17 more figures