Table of Contents
Fetching ...

RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements

Guangcong Zheng, Teng Li, Xianpan Zhou, Xi Li

TL;DR

RealCam-Vid tackles the lack of dynamic-scene data with metric-scale geometry for camera-controllable video synthesis by releasing a high-resolution, open-source dataset built from multiple sources and refined through a comprehensive processing pipeline. The pipeline includes clip-splitting using a learned Koala-36M-based method, motion-intensity filtering with CoTracker, long-form captioning via CogVLM2-Caption, robust MonST3R-based dynamic-scene camera annotation, and metric-scale alignment that converts depth to disparity and solves for a scale factor $s^* = \frac{\sum_i D^{abs}_i D^{rel}_i}{\sum_i (D^{rel}_i)^2}$. These components ensure temporally coherent sequences with metric-grounded camera trajectories, enabling models to learn both scene dynamics and accurate camera motion. The work thus provides a practical dataset and end-to-end annotation strategy that support robust, metric-aware training of camera-controllable video generation systems. By enabling cross-dataset metric consistency and dynamic content, RealCam-Vid has the potential to improve generalization to real-world physics in video synthesis applications.

Abstract

Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.

RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements

TL;DR

RealCam-Vid tackles the lack of dynamic-scene data with metric-scale geometry for camera-controllable video synthesis by releasing a high-resolution, open-source dataset built from multiple sources and refined through a comprehensive processing pipeline. The pipeline includes clip-splitting using a learned Koala-36M-based method, motion-intensity filtering with CoTracker, long-form captioning via CogVLM2-Caption, robust MonST3R-based dynamic-scene camera annotation, and metric-scale alignment that converts depth to disparity and solves for a scale factor . These components ensure temporally coherent sequences with metric-grounded camera trajectories, enabling models to learn both scene dynamics and accurate camera motion. The work thus provides a practical dataset and end-to-end annotation strategy that support robust, metric-aware training of camera-controllable video generation systems. By enabling cross-dataset metric consistency and dynamic content, RealCam-Vid has the potential to improve generalization to real-world physics in video synthesis applications.

Abstract

Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.

Paper Structure

This paper contains 6 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: Overview of Existing Datasets for Camera Motions and Scene Dynamics. Static Scene & Dynamic Camera videos boasts high aesthetic quality with dense relative-scale camera trajectory annotations but lacks object dynamics, which may lead to overfitting on rigid structures. Dynamic Scene & Static Camera videos capture dynamic objects yet omit camera motion, limiting their applicability in trajectory-based video generation. Dynamic Scene & Dynamic Camera videos feature rich real-world dynamics with both moving objects and camera motion while lack metric-scale camera annotations, rendering them unsuitable for metric-scale training. In this technical report, we release the first open-sourced high-resolution video dataset with dynamic scenes and metric-scale camera parameters in https://github.com/ZGCTroy/RealCam-Vid.
  • Figure 2: Our Data Filtering Pipeline. We employ a series of filters to refine the dataset, starting with three distinct sources: RealEstate10K zhou2018stereo, MiraData ju2024miradata, and DL3DV-10K ling2024dl3dv. These datasets undergo a series of stages, with key filters applied, including Video Length, Motion Intensity, and Outlier & Confidence filters. The final dataset, after processing through these filters, is curated using the VTSS Score Filter from Koala-36M wang2024koala. Gray bars show the amount of data filtered out by each filter, while the colored bars indicate the remaining data at each stage.
  • Figure 3: Pipeline for Metric Scale Alignment. This diagram illustrates the process of calibrating heterogeneous video sources to achieve cross-dataset compatibility by aligning relative-scale camera trajectory to absolute, metric scales. Depth maps are converted to disparity maps to suppress distant noise and highlight near-field detail. Metric-scale estimates are obtained via a metric depth predictor, while relative-scale disparities come from 4D reconstructions.