Table of Contents
Fetching ...

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Kairun Wen, Yuzhi Huang, Runyu Chen, Hui Zheng, Yunlong Lin, Panwang Pan, Chenxin Li, Wenyan Cong, Jian Zhang, Junbin Lu, Chenguo Lin, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Yue Huang, Xinghao Ding, Rakesh Ranjan, Zhiwen Fan

TL;DR

<3-5 sentence high-level summary> DynamicVerse tackles the challenge of scalable, real-world 4D data by introducing DynamicGen, an automated data curation pipeline that extracts metric-scale geometry, moving objects, and hierarchical captions from monocular videos. The framework integrates foundation models for geometry initialization, dynamic segmentation, and caption generation, and employs a multi-stage dynamic bundle adjustment to produce coherent 4D representations. The resulting DynamicVerse dataset contains over 100K dynamic scenes with 800K masks and 10M frames, enabling improved video depth, camera pose, and intrinsics estimation, as well as richer language-grounded scene descriptions. The work highlights practical impacts for 4D vision-language modeling, dynamic content generation, and embodied AI, while acknowledging noise, computational costs, and safety considerations inherent to web-video data.

Abstract

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

TL;DR

<3-5 sentence high-level summary> DynamicVerse tackles the challenge of scalable, real-world 4D data by introducing DynamicGen, an automated data curation pipeline that extracts metric-scale geometry, moving objects, and hierarchical captions from monocular videos. The framework integrates foundation models for geometry initialization, dynamic segmentation, and caption generation, and employs a multi-stage dynamic bundle adjustment to produce coherent 4D representations. The resulting DynamicVerse dataset contains over 100K dynamic scenes with 800K masks and 10M frames, enabling improved video depth, camera pose, and intrinsics estimation, as well as richer language-grounded scene descriptions. The work highlights practical impacts for 4D vision-language modeling, dynamic content generation, and embodied AI, while acknowledging noise, computational costs, and safety considerations inherent to web-video data.

Abstract

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

Paper Structure

This paper contains 42 sections, 12 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The overview of physically-aware multi-modal world modeling framework DynamicVerse.
  • Figure 2: The statistics and data source of DynamicVerse.
  • Figure 3: The physically-aware multi-modal 4D data generation pipeline DynamicGen.
  • Figure 4: Qualitative Results of Moving Object Segmentation. We show qualitatively some of our segmentation results on the Youtube-VIS dataset compared with other methods.
  • Figure 5: Visual comparisons of 4D reconstruction on in-the-wild data.
  • ...and 4 more figures