Table of Contents
Fetching ...

SpatialTracker: Tracking Any 2D Pixels in 3D Space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, Xiaowei Zhou

TL;DR

SpatialTracker addresses the challenge of dense, long-range pixel tracking by lifting 2D pixels into 3D using monocular depth and encoding the scene with a compact triplane representation. It predicts long-range 3D trajectories via an iterative transformer framework and enforces motion priors through an as-rigid-as-possible constraint with a learnable rigidity embedding to reveal rigid parts. The method achieves state-of-the-art results on 2D benchmarks (TAP-Vid, BADJA, PointOdyssey) and provides strong 3D tracking performance with RGBD input, demonstrating the benefits of 3D reasoning for video motion understanding. The work highlights the potential of integrating monocular depth priors with 3D trajectory modeling to improve robustness to occlusions and out-of-plane motion, with future gains expected from advances in depth estimation.

Abstract

Recovering dense and long-range pixel motion in videos is a challenging problem. Part of the difficulty arises from the 3D-to-2D projection process, leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be intricate, we posit that the underlying 3D motion can often be simple and low-dimensional. In this work, we propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators, represents the 3D content of each frame efficiently using a triplane representation, and performs iterative updates using a transformer to estimate 3D trajectories. Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts. Extensive evaluation shows that our approach achieves state-of-the-art tracking performance both qualitatively and quantitatively, particularly in challenging scenarios such as out-of-plane rotation.

SpatialTracker: Tracking Any 2D Pixels in 3D Space

TL;DR

SpatialTracker addresses the challenge of dense, long-range pixel tracking by lifting 2D pixels into 3D using monocular depth and encoding the scene with a compact triplane representation. It predicts long-range 3D trajectories via an iterative transformer framework and enforces motion priors through an as-rigid-as-possible constraint with a learnable rigidity embedding to reveal rigid parts. The method achieves state-of-the-art results on 2D benchmarks (TAP-Vid, BADJA, PointOdyssey) and provides strong 3D tracking performance with RGBD input, demonstrating the benefits of 3D reasoning for video motion understanding. The work highlights the potential of integrating monocular depth priors with 3D trajectory modeling to improve robustness to occlusions and out-of-plane motion, with future gains expected from advances in depth estimation.

Abstract

Recovering dense and long-range pixel motion in videos is a challenging problem. Part of the difficulty arises from the 3D-to-2D projection process, leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be intricate, we posit that the underlying 3D motion can often be simple and low-dimensional. In this work, we propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method, named SpatialTracker, lifts 2D pixels to 3D using monocular depth estimators, represents the 3D content of each frame efficiently using a triplane representation, and performs iterative updates using a transformer to estimate 3D trajectories. Tracking in 3D allows us to leverage as-rigid-as-possible (ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts. Extensive evaluation shows that our approach achieves state-of-the-art tracking performance both qualitatively and quantitatively, particularly in challenging scenarios such as out-of-plane rotation.
Paper Structure (24 sections, 7 equations, 3 figures, 5 tables)

This paper contains 24 sections, 7 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of Our Pipeline. We first encode each frame into a triplane representation (a) using a triplane encoder (b). We then initialize and iteratively update point trajectories in the 3D space using a transformer with features extracted from these triplanes as input (c). The 3D trajectories are trained with ground truth annotations and are regularized by an as-rigid-as-possible (ARAP) constraint with learned rigidity embedding (d). The ARAP constraint enforces that 3D distances between points with similar rigidity embeddings remain constant over time. Here $d_{ij}$ represents the distance between points $i$ and $j$, while $s_{ij}$ denotes the rigid similarity. Our method produces accurate long-range motion tracks even under fast movements and severe occlusion (e).
  • Figure 2: Qualitative Comparison. For each sequence we show tracking results of CoTracker cotracker and our method SpatialTracker.
  • Figure 3: Rigid Part Segmentation. We utilize spectral clustering on the rigidity embedding to determine rigid groups. Each color represents a distinct rigid group.