Table of Contents
Fetching ...

PointSt3R: Point Tracking through 3D Grounded Correspondence

Rhodri Guerrier, Adam W. Harley, Dima Damen

TL;DR

PointSt3R recasts point tracking as 2-frame 3D-grounded correspondence by fine-tuning MASt3R with a dynamic matching loss and a visibility head, enabling robust tracking in dynamic scenes without temporal context during inference. The method trains on synthetic, dynamically-annotated pairs to learn long-range correspondences and uses nearest-neighbor feature matching to perform both 2D and 3D tracking. Across TAP-Vid-DAVIS, RoboTAP, RGB-S, EgoPoints, and PStudio, PointSt3R achieves competitive 2D tracking performance and notable Gains in 3D tracking relative to static baselines, often approaching state-of-the-art trackers. The results highlight the value of 3D-grounded representations for dynamic point tracking and demonstrate the feasibility of dynamic tracking achieved through targeted 3D reconstruction models without relying on temporal context at inference.

Abstract

Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $δ_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.

PointSt3R: Point Tracking through 3D Grounded Correspondence

TL;DR

PointSt3R recasts point tracking as 2-frame 3D-grounded correspondence by fine-tuning MASt3R with a dynamic matching loss and a visibility head, enabling robust tracking in dynamic scenes without temporal context during inference. The method trains on synthetic, dynamically-annotated pairs to learn long-range correspondences and uses nearest-neighbor feature matching to perform both 2D and 3D tracking. Across TAP-Vid-DAVIS, RoboTAP, RGB-S, EgoPoints, and PStudio, PointSt3R achieves competitive 2D tracking performance and notable Gains in 3D tracking relative to static baselines, often approaching state-of-the-art trackers. The results highlight the value of 3D-grounded representations for dynamic point tracking and demonstrate the feasibility of dynamic tracking achieved through targeted 3D reconstruction models without relying on temporal context at inference.

Abstract

Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ( on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.

Paper Structure

This paper contains 18 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison of approaches: CoTracker3 karaev2024cotracker3, MASt3R leroy2024grounding and PointSt3R, on four selected timesteps of a TAP-Vid-DAVIS doersch2022tap video. The blue dots represent the ground truth positions, whilst the red represent the predictions of the models. Where MASt3R clearly fails to track the dynamic scene, PointSt3R recovers accuracy, performing on par with CoTracker3. The yellow bars represent the visibility accuracy per frame. MASt3R has $0\%$ as it has no ability to predict visibility, whilst PointSt3R produces comparable results to CoTracker3.
  • Figure 2: PCA visualisation of feature maps extracted from CoTracker3 karaev2024cotracker3, MASt3R leroy2024grounding and our PointSt3R on two frames of a TAP-Vid-DAVIS doersch2022tap video. CoTracker3 produces locally-sensitive features without global artifacts. MASt3R produces globally-sensitive features that are represent the 3D static background, with no unique features for dynamic objects. PointSt3R balances this by introducing local features and retaining 3D global features.
  • Figure 3: Tracking accuracy $\delta_{\text{avg}}\uparrow$ of PointSt3R when changing the percentage of dynamic correspondences $r$, where $0\%$ is only static correspondences and $100\%$ is dynamic correspondences only. The dashed lines represent MASt3R leroy2024grounding.
  • Figure 4: Comparison of tracking accuracy $\delta_{\text{avg}}\uparrow$ on TAP-Vid-DAVIS doersch2022tap against track length (in frames) for MASt3R leroy2024grounding and PointSt3R (Dynamic data) with two stride options for training. Green line shows the advantage of bigger strides for longer tracks.
  • Figure 5: Comparison of MASt3R leroy2024grounding (left) and PointSt3R (right) tracks on TAP-Vid-DAVIS doersch2022tap. The blue dots represent the ground truth positions, whilst the red represent the predictions of the models. When prediction is far from GT, a red dash/line is visible. Larger errors are evident for MASt3R.
  • ...and 1 more figures