Table of Contents
Fetching ...

The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

Zhuoyuan Wu, Xurui Yang, Jiahui Huang, Yue Wang, Jun Gao

TL;DR

This work tackles robust 3D structure understanding from casual dynamic videos by introducing Dynapo, a semantics-guided dynamic prior that identifies moving objects without task-specific training. Dynapo combines Vision-Language Models for dynamic object reasoning with SAM2-based segmentation to generate accurate, instance-aware dynamic masks, which are then integrated into camera pose optimization, depth estimation, and 4D trajectory recovery. The approach yields state-of-the-art motion segmentation and substantial improvements in downstream 3D tasks across synthetic and real datasets, highlighting the practicality and generalizability of a reasoning-driven dynamic prior for real-world scenes.

Abstract

Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.

The Dynamic Prior: Understanding 3D Structures for Casual Dynamic Videos

TL;DR

This work tackles robust 3D structure understanding from casual dynamic videos by introducing Dynapo, a semantics-guided dynamic prior that identifies moving objects without task-specific training. Dynapo combines Vision-Language Models for dynamic object reasoning with SAM2-based segmentation to generate accurate, instance-aware dynamic masks, which are then integrated into camera pose optimization, depth estimation, and 4D trajectory recovery. The approach yields state-of-the-art motion segmentation and substantial improvements in downstream 3D tasks across synthetic and real datasets, highlighting the practicality and generalizability of a reasoning-driven dynamic prior for real-world scenes.

Abstract

Estimating accurate camera poses, 3D scene geometry, and object motion from in-the-wild videos is a long-standing challenge for classical structure from motion pipelines due to the presence of dynamic objects. Recent learning-based methods attempt to overcome this challenge by training motion estimators to filter dynamic objects and focus on the static background. However, their performance is largely limited by the availability of large-scale motion segmentation datasets, resulting in inaccurate segmentation and, therefore, inferior structural 3D understanding. In this work, we introduce the Dynamic Prior (\ourmodel) to robustly identify dynamic objects without task-specific training, leveraging the powerful reasoning capabilities of Vision-Language Models (VLMs) and the fine-grained spatial segmentation capacity of SAM2. \ourmodel can be seamlessly integrated into state-of-the-art pipelines for camera pose optimization, depth reconstruction, and 4D trajectory estimation. Extensive experiments on both synthetic and real-world videos demonstrate that \ourmodel not only achieves state-of-the-art performance on motion segmentation, but also significantly improves accuracy and robustness for structural 3D understanding.

Paper Structure

This paper contains 28 sections, 19 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: Current works on estimating 3D structures from dynamic videos typically reply on segmenting out dynamic objects for robust bundle adjustment. However, the dynamic object masks from these works are typically inaccurate, as shown in Fig (a). Our method can reason the dynamic scene and generate precise masks for dynamic objects. These masks can be seamlessly integrated to camera pose optimization (Fig (b)), depth optimization (Fig (c)), and 4D track optimization (Fig (d)) for robust structure 3D understanding.
  • Figure 2: Overview of Dynapo. Given a video sequence, the Dynamic Object Reasoning takes input of sub-sampled keyframes, reasons all the dynamic objects within the video, and generates descriptions $s^i$ and the frame number $f^i$ for each dynamic object (in total, $k$ objects). The Dynamic Object Segmentation then generates a mask sequence for each dynamic object, which we take an average to produce the final dynamic mask.
  • Figure 3: Illustration of our BA. When calculating the reprojection loss, we effectively remove the dynamic objects (pink tracks) from the loss function using the mask from Dynapo, and the BA can only focus on the static background for optimization.
  • Figure 4: Visualization of uncertainty map. MegaSaM produces inaccurate dynamic masks, deteriorating depth optimization, while our Dynapo generates cleaner and more plausible dynamic masks.
  • Figure 5: Visualization of motion mask. Motion scores from Stereo4D jin2025stereo4d are inaccurate, where some dynamic tracks can have extremely small motion scores, leading to inappropriate loss weighting during optimization.
  • ...and 17 more figures