Table of Contents
Fetching ...

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, Daniel Cremers

TL;DR

This paper tackles dynamic scene reconstruction from casual videos by decoupling camera motion from object motion using a learnable 3D tracker, enabling traditional bundle adjustment to operate on both static and dynamic points. It introduces BA-Track, a three-stage pipeline: a motion-decoupled 3D tracker, RGB-D bundle adjustment, and a global depth refinement that enforces depth consistency and rigidity. The approach yields improved camera pose accuracy (ATE) and more coherent dense reconstructions across challenging dynamic datasets, while maintaining memory efficiency. The results demonstrate that combining deep priors with classical optimization can robustly handle real-world dynamic scenes, with potential for future joint intrinsic refinement and richer depth models.

Abstract

Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction

TL;DR

This paper tackles dynamic scene reconstruction from casual videos by decoupling camera motion from object motion using a learnable 3D tracker, enabling traditional bundle adjustment to operate on both static and dynamic points. It introduces BA-Track, a three-stage pipeline: a motion-decoupled 3D tracker, RGB-D bundle adjustment, and a global depth refinement that enforces depth consistency and rigidity. The approach yields improved camera pose accuracy (ATE) and more coherent dense reconstructions across challenging dynamic datasets, while maintaining memory efficiency. The results demonstrate that combining deep priors with classical optimization can robustly handle real-world dynamic scenes, with potential for future joint intrinsic refinement and richer depth models.

Abstract

Traditional SLAM systems, which rely on bundle adjustment, struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates. Taking a novel approach, this work leverages a 3D point tracker to separate the camera-induced motion from the observed motion of dynamic objects. By considering only the camera-induced component, bundle adjustment can operate reliably on all scene elements as a result. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps. Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end. Integrating motion decomposition, bundle adjustment and depth refinement, our unified framework, BA-Track, accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.

Paper Structure

This paper contains 17 sections, 16 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Framework preview. Given a casual input video, BA-Track uses a 3D tracker to separate camera-induced motion from the total observed motion, enabling bundle adjustment to process both static and dynamic points. Using the aligned sparse point tracks from bundle adjustment, we refine the dense depth maps, producing a globally consistent dynamic scene reconstruction.
  • Figure 2: Dynamic scene reconstruction results on Shibuya qiu2022airdos, DAVIS khoreva2019video, and Aria Everyday Activities lv2024aria datasets. By leveraging sparse dynamic SLAM and global refinement, BA-Track achieves consistent dense 3D reconstructions across diverse dynamic scenes.
  • Figure 3: Overview of the BA-Track framework. Given a temporal window, we compute image features $\mathbf{F}$ and depth features $\mathbf{D}$. Our 3D tracker estimates local 3D tracks, visibility, dynamic labels, and decouples the static (camera-induced) motion of each query point. Operating on the static motion components, bundle adjustment (BA) recovers the camera poses and global tracks. The final refinement stage aligns the monocular depth priors with sparse BA estimates to ensure a temporally consistent and dense reconstruction.
  • Figure 4: Illustration of motion decoupling. We decompose the total observed point motion into a static component (induced by the camera motion) and a dynamic component (induced by object motion). The static component is then used by bundle adjustment to provide camera poses and sparse reconstruction.
  • Figure 5: Qualitative camera pose estimation results on Sintel Butler:2012:Sintel, Shibuya qiu2022airdos, and Epic Fields tschernezki2024epic. Visualizations demonstrate that our method achieves more robust and accurate camera trajectories in challenging dynamic scenes.
  • ...and 5 more figures