Dynamic Camera Poses and Where to Find Them
Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F. Fouhey, Chen-Hsuan Lin
TL;DR
This work tackles the difficulty of annotating camera poses in dynamic Internet videos by introducing DynPose-100K, a large-scale dataset with 100K videos and high-quality camera annotations. It combines a multi-stage filtering pipeline (specialist models and a vision-language model) to identify suitable footage and a dynamic pose-estimation pipeline (dynamic masking, BootsTAP tracking, and Theia-SfM) to recover accurate poses. Key contributions include the DynPose-100K dataset, a robust filtering methodology achieving high precision, a dynamic pose-estimation framework, and the Lightspeed synthetic benchmark for ground-truth pose evaluation. The dataset and benchmarks enable advancements in camera-controlled video synthesis, extended reality, and robotics in realistic, dynamic settings.
Abstract
Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.
