PointSt3R: Point Tracking through 3D Grounded Correspondence
Rhodri Guerrier, Adam W. Harley, Dima Damen
TL;DR
PointSt3R recasts point tracking as 2-frame 3D-grounded correspondence by fine-tuning MASt3R with a dynamic matching loss and a visibility head, enabling robust tracking in dynamic scenes without temporal context during inference. The method trains on synthetic, dynamically-annotated pairs to learn long-range correspondences and uses nearest-neighbor feature matching to perform both 2D and 3D tracking. Across TAP-Vid-DAVIS, RoboTAP, RGB-S, EgoPoints, and PStudio, PointSt3R achieves competitive 2D tracking performance and notable Gains in 3D tracking relative to static baselines, often approaching state-of-the-art trackers. The results highlight the value of 3D-grounded representations for dynamic point tracking and demonstrate the feasibility of dynamic tracking achieved through targeted 3D reconstruction models without relying on temporal context at inference.
Abstract
Recent advances in foundational 3D reconstruction models, such as DUSt3R and MASt3R, have shown great potential in 2D and 3D correspondence in static scenes. In this paper, we propose to adapt them for the task of point tracking through 3D grounded correspondence. We first demonstrate that these models are competitive point trackers when focusing on static points, present in current point tracking benchmarks ($+33.5\%$ on EgoPoints vs. CoTracker2). We propose to combine the reconstruction loss with training for dynamic correspondence along with a visibility head, and fine-tuning MASt3R for point tracking using a relatively small amount of synthetic data. Importantly, we only train and evaluate on pairs of frames where one contains the query point, effectively removing any temporal context. Using a mix of dynamic and static point correspondences, we achieve competitive or superior point tracking results on four datasets (e.g. competitive on TAP-Vid-DAVIS 73.8 $δ_{avg}$ / 85.8\% occlusion acc. for PointSt3R compared to 75.7 / 88.3\% for CoTracker2; and significantly outperform CoTracker3 on EgoPoints 61.3 vs 54.2 and RGB-S 87.0 vs 82.8). We also present results on 3D point tracking along with several ablations on training datasets and percentage of dynamic correspondences.
