RealCam-Vid: High-resolution Video Dataset with Dynamic Scenes and Metric-scale Camera Movements
Guangcong Zheng, Teng Li, Xianpan Zhou, Xi Li
TL;DR
RealCam-Vid tackles the lack of dynamic-scene data with metric-scale geometry for camera-controllable video synthesis by releasing a high-resolution, open-source dataset built from multiple sources and refined through a comprehensive processing pipeline. The pipeline includes clip-splitting using a learned Koala-36M-based method, motion-intensity filtering with CoTracker, long-form captioning via CogVLM2-Caption, robust MonST3R-based dynamic-scene camera annotation, and metric-scale alignment that converts depth to disparity and solves for a scale factor $s^* = \frac{\sum_i D^{abs}_i D^{rel}_i}{\sum_i (D^{rel}_i)^2}$. These components ensure temporally coherent sequences with metric-grounded camera trajectories, enabling models to learn both scene dynamics and accurate camera motion. The work thus provides a practical dataset and end-to-end annotation strategy that support robust, metric-aware training of camera-controllable video generation systems. By enabling cross-dataset metric consistency and dynamic content, RealCam-Vid has the potential to improve generalization to real-world physics in video synthesis applications.
Abstract
Recent advances in camera-controllable video generation have been constrained by the reliance on static-scene datasets with relative-scale camera annotations, such as RealEstate10K. While these datasets enable basic viewpoint control, they fail to capture dynamic scene interactions and lack metric-scale geometric consistency-critical for synthesizing realistic object motions and precise camera trajectories in complex environments. To bridge this gap, we introduce the first fully open-source, high-resolution dynamic-scene dataset with metric-scale camera annotations in https://github.com/ZGCTroy/RealCam-Vid.
