Table of Contents
Fetching ...

Dynamic Camera Poses and Where to Find Them

Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F. Fouhey, Chen-Hsuan Lin

TL;DR

This work tackles the difficulty of annotating camera poses in dynamic Internet videos by introducing DynPose-100K, a large-scale dataset with 100K videos and high-quality camera annotations. It combines a multi-stage filtering pipeline (specialist models and a vision-language model) to identify suitable footage and a dynamic pose-estimation pipeline (dynamic masking, BootsTAP tracking, and Theia-SfM) to recover accurate poses. Key contributions include the DynPose-100K dataset, a robust filtering methodology achieving high precision, a dynamic pose-estimation framework, and the Lightspeed synthetic benchmark for ground-truth pose evaluation. The dataset and benchmarks enable advancements in camera-controlled video synthesis, extended reality, and robotics in realistic, dynamic settings.

Abstract

Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.

Dynamic Camera Poses and Where to Find Them

TL;DR

This work tackles the difficulty of annotating camera poses in dynamic Internet videos by introducing DynPose-100K, a large-scale dataset with 100K videos and high-quality camera annotations. It combines a multi-stage filtering pipeline (specialist models and a vision-language model) to identify suitable footage and a dynamic pose-estimation pipeline (dynamic masking, BootsTAP tracking, and Theia-SfM) to recover accurate poses. Key contributions include the DynPose-100K dataset, a robust filtering methodology achieving high precision, a dynamic pose-estimation framework, and the Lightspeed synthetic benchmark for ground-truth pose evaluation. The dataset and benchmarks enable advancements in camera-controlled video synthesis, extended reality, and robotics in realistic, dynamic settings.

Abstract

Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.

Paper Structure

This paper contains 30 sections, 16 figures, 10 tables.

Figures (16)

  • Figure 1: We introduce DynPose-100K, a large-scale video dataset of dynamic content with camera annotations. DynPose-100K consists of 100,131 Internet videos that span diverse settings. We curate DynPose-100K such that videos contain dynamic content while ensuring the cameras are able to be estimated (including intrinsics and poses). Towards this end, we address two challenging problems: (a) identifying the videos suitable for camera estimation, and (b) improving the camera estimation algorithm for dynamic videos.
  • Figure 2: Panda-Test dataset statistics. Statistics reflect human labels on held-out 1K video Panda-Test set, detailed in § \ref{['sec:analysis_panda70m_filtering']}. Only 9% are target dynamic camera pose estimation videos due to various issues, e.g. static scene, low-quality or non-real content, and ambiguous or blurry frame of reference. We focus on moving cameras to facilitate downstream tasks e.g. camera-controlled video generation and learned pose estimation. We remove unsuitable videos using a combination of specialist models and a generalist VLM.
  • Figure 3: Pose estimation approach. We apply the state-of-the-art point tracking method at a sliding window to produce dense, long-term correspondences. Complementary dynamic masks are used to remove non-static tracks. The remaining static tracks are provided as input to global bundle adjustment.
  • Figure 5: Dynamic video filtering on Panda-Test. We show PR curves for baselines and ablations. Our filtering surpasses all baselines and ablations by a considerable margin. The represents DynPose-100K's operating thresholds. For baselines, we show: $\blacksquare$ Reconstructed points (CamCo xu2024camco), $\blacksquare$ Reprojection error, (solid $\blacksquare$) GPT-4o mini openai2024gpt4o: binary, (dashed $\blacksquare$) GPT-4o mini openai2024gpt4o: score, $\blacksquare$ Hands23 cheng2023towards, and $\blacksquare$ Ours. For ablations, we begin from $\blacksquare$ Hands23 and add components until we recover $\blacksquare$ Ours. Specifically, we depict: $\blacksquare$ Hands23, $\blacksquare$ +Flow, $\blacksquare$ +Tracking, $\blacksquare$ +Masking, $\blacksquare$ +Focal, $\blacksquare$ +Distort, $\blacksquare$ +VLM (Ours).
  • Figure 6: Left: Targeted video length. DynPose-100K videos are primarily 4-10s, ideal for dynamic pose: shorter videos contain little ego-motion, longer videos have less dense dynamics and ego-motion. Right: Diverse dynamic apparent size. Mean size in % across video. Large dynamic objects occlude static correspondences, making pose estimation challenging. Videos may average small size in the case of only a few dynamic frames.
  • ...and 11 more figures