Table of Contents
Fetching ...

DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

Jenny Seidenschwarz, Qunjie Zhou, Bardienus Duisterhof, Deva Ramanan, Laura Leal-Taixé

TL;DR

DynOMo tackles online point tracking from unposed monocular videos by jointly reconstructing a dynamic scene and localizing the camera using an augmented 3D Gaussian Splatting representation. By attaching semantic labels and rich visual features to each Gaussian and enforcing physics-inspired 3D regularizers, the method induces emergent 2D/3D point trajectories without correspondence-level supervision. The approach yields competitive online performance on TAPVid-DAVIS and Panoptic Sports against offline and online baselines, while progressively exploring new scene content through densification. This work establishes a strong monocular, pose-free baseline for online tracking and scene reconstruction with potential impact on mobile robotics and mixed reality, and points to further gains from improved depth and trajectory estimation in real time.

Abstract

Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.

DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

TL;DR

DynOMo tackles online point tracking from unposed monocular videos by jointly reconstructing a dynamic scene and localizing the camera using an augmented 3D Gaussian Splatting representation. By attaching semantic labels and rich visual features to each Gaussian and enforcing physics-inspired 3D regularizers, the method induces emergent 2D/3D point trajectories without correspondence-level supervision. The approach yields competitive online performance on TAPVid-DAVIS and Panoptic Sports against offline and online baselines, while progressively exploring new scene content through densification. This work establishes a strong monocular, pose-free baseline for online tracking and scene reconstruction with potential impact on mobile robotics and mixed reality, and points to further gains from improved depth and trajectory estimation in real time.

Abstract

Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction [14], while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time [24, 39]. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D [38]. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.
Paper Structure (21 sections, 21 equations, 6 figures, 7 tables)

This paper contains 21 sections, 21 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Monocular Online Point Tracking: In this work, we present DynOMo for the task of monocular online point tracking from pose-free videos through joint 3D reconstruction and camera localization based on a dynamic 3D Gaussian representation. Please find the code and more visualizations on our project page https://jennyseidenschwarz.github.io/DynOMo.github.io.
  • Figure 2: Tracking points with online dynamic monocular reconstruction: Our pipeline assumes an input video sequence, (predicted) depth maps, sparse segmentation masks as well as image features as input. In our online reconstruction pipeline we optimize for the camera pose $\mathcal{C}$, add a set of new Gaussians based on the densification concept keetha2024splatam, optimize all Gaussians together and forward propagate $\mathcal{G}$ and $\mathcal{C}$. Finally, we directly extract 3D point trajectories from single Gaussians $G_p$ and project them to the image plane to obtain 2D trajectories.
  • Figure 3: Increasing the World Online: Our pipeline is able to gradually add Gaussians to the world. This allows to explore the underlying world as the video progresses. We visualize the increase of the world for two sequences of TAPVid-DAVIS.
  • Figure 4: Failure Cases of DynOMo:DynOMo struggles (i) to track points and add new Gaussians in sequences with extreme occlusions; (ii) to track the camera position as well as the Gaussians in sequences with extreme camera and object motion as well as low background texture; (iii) to track points when extreme acceleration changes occur; (iv) to track Gaussians and camera positions when solely little background is observed but extreme camera motion occurs.
  • Figure 5: Visualizations on TAPVID-Davis: We visualize renderings as well as point tracks on challenging scenes on TAPVID-Davis. DynOMo is able to generate emerging trajectories despite facing non-rigid motion, occlusions, appearance of new objects, or fast motion.
  • ...and 1 more figures