Table of Contents
Fetching ...

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

Jiahao Lu, Weitao Xiong, Jiacheng Deng, Peng Li, Tianyu Huang, Zhiyang Dou, Cheng Lin, Sai-Kit Yeung, Yuan Liu

TL;DR

TrackingWorld presents a two-stage approach for dense, world-centric monocular 3D tracking: first densifying 2D tracks with a track upsampler to cover nearly all pixels, then optimizing camera poses and lifting tracks into a shared world-centric 3D frame. It explicitly disentangles camera motion from object motion through an as-static-as-possible constraint and dynamic-object-aware optimization, enabling tracking of newly emerging dynamic regions across all frames. The method demonstrates improved camera pose accuracy, depth consistency, and geometric fidelity for both dense and sparse tracking, with favorable runtime due to clip-to-global parallel optimization and selective downsampling. Overall, TrackingWorld delivers dense, temporally coherent 3D trajectories in a world-centric frame, useful for motion analysis, video editing, and 3D scene understanding in monocular videos.

Abstract

Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.

TrackingWorld: World-centric Monocular 3D Tracking of Almost All Pixels

TL;DR

TrackingWorld presents a two-stage approach for dense, world-centric monocular 3D tracking: first densifying 2D tracks with a track upsampler to cover nearly all pixels, then optimizing camera poses and lifting tracks into a shared world-centric 3D frame. It explicitly disentangles camera motion from object motion through an as-static-as-possible constraint and dynamic-object-aware optimization, enabling tracking of newly emerging dynamic regions across all frames. The method demonstrates improved camera pose accuracy, depth consistency, and geometric fidelity for both dense and sparse tracking, with favorable runtime due to clip-to-global parallel optimization and selective downsampling. Overall, TrackingWorld delivers dense, temporally coherent 3D trajectories in a world-centric frame, useful for motion analysis, video editing, and 3D scene understanding in monocular videos.

Abstract

Monocular 3D tracking aims to capture the long-term motion of pixels in 3D space from a single monocular video and has witnessed rapid progress in recent years. However, we argue that the existing monocular 3D tracking methods still fall short in separating the camera motion from foreground dynamic motion and cannot densely track newly emerging dynamic subjects in the videos. To address these two limitations, we propose TrackingWorld, a novel pipeline for dense 3D tracking of almost all pixels within a world-centric 3D coordinate system. First, we introduce a tracking upsampler that efficiently lifts the arbitrary sparse 2D tracks into dense 2D tracks. Then, to generalize the current tracking methods to newly emerging objects, we apply the upsampler to all frames and reduce the redundancy of 2D tracks by eliminating the tracks in overlapped regions. Finally, we present an efficient optimization-based framework to back-project dense 2D tracks into world-centric 3D trajectories by estimating the camera poses and the 3D coordinates of these 2D tracks. Extensive evaluations on both synthetic and real-world datasets demonstrate that our system achieves accurate and dense 3D tracking in a world-centric coordinate frame.

Paper Structure

This paper contains 50 sections, 20 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: TrackingWorld estimates world-centric dense tracking results from monocular videos. Our model can accurately estimate camera poses and achieve disentangled 3D track modeling of static and dynamic components, not just limited to one foreground dynamic object. We only visualize a subset of foreground dynamic point trajectories and apply a fading color to background static points.
  • Figure 2: Overview. Given a video sequence, TrackingWorld first generates dense 2D tracking results that are capable of capturing newly emerging objects in the scene. These 2D trajectories are then fed into an optimization-based framework to transform them into a world-centric 3D space. Specifically, we begin by estimating the initial camera poses for each frame at the clip level. We then perform dynamic background refinement to exclude potentially dynamic regions and refine the camera poses. Based on the optimized poses, we finally reconstruct the trajectories of all dynamic regions.
  • Figure 3: Sparse 3D tracking results. "Feed." means feedforward methods while "Optim" means optimization-b ased method.
  • Figure 4: Long-range optical flow results.
  • Figure 5: Qualitative results on DAVIS dataset. Our method can output both reliable camera trajectories and world centric dense tracking. The second row visualizes 3D tracking results on temporally spaced keyframes, while the third row shows complete tracks across continuous frames.
  • ...and 5 more figures