Table of Contents
Fetching ...

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, Mingyu Liu, Dingning Liu, Jiange Yang, Zhoujie Fu, Junyi Chen, Chunhua Shen, Jiangmiao Pang, Kaipeng Zhang, Tong He

TL;DR

OmniWorld tackles the data bottleneck in 4D world modeling by introducing a large-scale, multi-domain, multi-modal dataset that merges a self-collected OmniWorld-Game subset with publicly available sources. It establishes a two-task benchmark for 3D geometric prediction and camera-controlled video generation, and demonstrates that finetuning state-of-the-art models on OmniWorld yields substantial gains in both geometric reconstruction and dynamic video synthesis. The work emphasizes rich modalities (depth, poses, optical flow, captions) and long, diverse sequences to robustly train and evaluate 4D models. By providing comprehensive annotations and benchmarking tools, OmniWorld aims to accelerate the development of general-purpose, robust 4D world models that better understand and interact with the physical world.

Abstract

The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

TL;DR

OmniWorld tackles the data bottleneck in 4D world modeling by introducing a large-scale, multi-domain, multi-modal dataset that merges a self-collected OmniWorld-Game subset with publicly available sources. It establishes a two-task benchmark for 3D geometric prediction and camera-controlled video generation, and demonstrates that finetuning state-of-the-art models on OmniWorld yields substantial gains in both geometric reconstruction and dynamic video synthesis. The work emphasizes rich modalities (depth, poses, optical flow, captions) and long, diverse sequences to robustly train and evaluate 4D models. By providing comprehensive annotations and benchmarking tools, OmniWorld aims to accelerate the development of general-purpose, robust 4D world models that better understand and interact with the physical world.

Abstract

The field of 4D world modeling - aiming to jointly capture spatial geometry and temporal dynamics - has witnessed remarkable progress in recent years, driven by advances in large-scale generative models and multimodal learning. However, the development of truly general 4D world models remains fundamentally constrained by the availability of high-quality data. Existing datasets and benchmarks often lack the dynamic complexity, multi-domain diversity, and spatial-temporal annotations required to support key tasks such as 4D geometric reconstruction, future prediction, and camera-control video generation. To address this gap, we introduce OmniWorld, a large-scale, multi-domain, multi-modal dataset specifically designed for 4D world modeling. OmniWorld consists of a newly collected OmniWorld-Game dataset and several curated public datasets spanning diverse domains. Compared with existing synthetic datasets, OmniWorld-Game provides richer modality coverage, larger scale, and more realistic dynamic interactions. Based on this dataset, we establish a challenging benchmark that exposes the limitations of current state-of-the-art (SOTA) approaches in modeling complex 4D environments. Moreover, fine-tuning existing SOTA methods on OmniWorld leads to significant performance gains across 4D reconstruction and video generation tasks, strongly validating OmniWorld as a powerful resource for training and evaluation. We envision OmniWorld as a catalyst for accelerating the development of general-purpose 4D world models, ultimately advancing machines' holistic understanding of the physical world.

Paper Structure

This paper contains 27 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: We introduce OmniWorld, a large-scale, multi-domain, and multi-modal dataset. OmniWorld provides a rich resource for 4D world modeling by integrating high-quality data from multiple domains and offers a variety of data types, including depth maps, camera poses, text captions, optical flow and foreground masks. OmniWorld is designed to accelerate the development of more general models for modeling the real physical world.
  • Figure 2: OmniWorld acquisition and annotation pipeline. We collect raw data from diverse domains and apply a video slicing filter to obtain high-quality RGB sequences. These sequences are then processed through a suite of specialized pipelines to generate multi-modal annotations, including text captions, depth maps, camera poses, foreground masks, and optical flow.
  • Figure 3: Statistical information of OmniWorld. (a) displays compositional distribution of data from different domains within OmniWorld, (b) presents internal composition of OmniWorld-Game. (c) shows caption tokens distribution of OmniWorld.
  • Figure 4: Qualitative comparison of Monocular Depth Estimation on OmniWorld-Game benchmark.
  • Figure 5: Qualitative comparison of multi-view 3D reconstruction on OmniWorld-Game benchmark.
  • ...and 5 more figures