Table of Contents
Fetching ...

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen

Abstract

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Abstract

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

Paper Structure

This paper contains 11 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: SceneScribe-1M offers more than one million dynamic scenes spanning over 4,000 hours, featuring comprehensive semantic and geometric annotations (i.e., detailed description, motion masks, camera poses, continuous video depths, and dynamic tracks). It supports diverse downstream tasks (i.e., modular depth estimation, scene reconstruction, dynamic point tracking, and pose/text-to-video generation).
  • Figure 2: Curation Pipeline for SceneScribe-1M consist of: (a) We begin by collecting large-scale videos from various sources; (b) Raw videos undergo specification and content inspection, with temporal segmentation models employed to ensure continuity; and (c) We integrate Qwen2.5-VL-72B bai2025qwen2, MegaSaM li2025megasam, and TAPIP3D zhang2025tapip3d to perform comprehensive geometric and semantic annotations.
  • Figure 3: Statistics of Raw Video Specification after filtering, including Resolution, Frame Per Second (FPS), and Duration.
  • Figure 4: Statistics of Raw Video Content after filtering. These charts demonstrate that the raw videos exhibit sufficient diversity of motion while eliminating the lighting interference.
  • Figure 5: Caption Statistics: (a) The average caption length is adequate to capture the details within each scene, and (b) Key words (e.g.,atmosphere, subject, and take place) effectively cover aspects such as the scene context, primary objects, and actions.
  • ...and 3 more figures