Table of Contents
Fetching ...

M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

Qiangqiang Wu, Tianyu Yang, Bo Fang, Jia Wan, Matias Di Martino, Guillermo Sapiro, Antoni B. Chan

Abstract

Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.

M2P: Improving Visual Foundation Models with Mask-to-Point Weakly-Supervised Learning for Dense Point Tracking

Abstract

Tracking Any Point (TAP) has emerged as a fundamental tool for video understanding. Current approaches adapt Vision Foundation Models (VFMs) like DINOv2 via offline finetuning or test-time optimization. However, these VFMs rely on static image pre-training, which is inherently sub-optimal for capturing dense temporal correspondence in videos. To address this, we propose Mask-to-Point (M2P) learning, which leverages rich video object segmentation (VOS) mask annotations to improve VFMs for dense point tracking. Our M2P introduces three new mask-based constraints for weakly-supervised representation learning. First, we propose a local structure consistency loss, which leverages Procrustes analysis to model the cohesive motion of points lying within a local structure, achieving more reliable point-to-point matching learning. Second, we propose a mask label consistency (MLC) loss, which enforces that sampled foreground points strictly match foreground regions across frames. The proposed MLC loss can be regarded as a regularization, which stabilizes training and prevents convergence to trivial solutions. Finally, mask boundary constrain is applied to explicitly supervise boundary points. We show that our weaklysupervised M2P models significantly outperform baseline VFMs with efficient training by using only 3.6K VOS training videos. Notably, M2P achieves 12.8% and 14.6% performance gains over DINOv2-B/14 and DINOv3-B/16 on the TAP-Vid-DAVIS benchmark, respectively. Moreover, the proposed M2P models are used as pre-trained backbones for both test-time optimized and offline fine-tuned TAP tasks, demonstrating its potential to serve as general pre-trained models for point tracking. Code will be made publicly available upon acceptance.
Paper Structure (16 sections, 20 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 16 sections, 20 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: Our proposed Mask-to-Point (M2P) learning improves visual foundation models (e.g., DINOs) on dense point tracking, while being trained on only 3.6K VOS videos.
  • Figure 2: Overall pipeline of the proposed Mask-to-Point (M2P) weakly-supervised learning, which enhances existing visual foundation models (VFMs) for dense point tracking by leveraging video object segmentation (VOS) datasets with mask annotations. Our M2P introduces three new mask-based constraints for representation learning, including local structure consistency, mask label consistency and mask boundary constrain.
  • Figure 3: Illustration of our proposed local structure consistency loss $\mathcal{L}_{LSC}$. The top-$K_e$ strong matches are found between query points $\mathbf{p}^g$ and predicted correspondences, $\{\tilde{\mathbf{p}}^g\}_{i=1}^{K_e}$. These strongly matched points are used to estimate a simalarity transformation, which is then applied to the other points to obtain remaining pseudo-labels $\tilde{\mathbf{p}}^g_i$, $i>K_e$, for supervision.
  • Figure 4: Illustration of the confidence score $\mathcal{S}_i^g$ in our mask label consistency constraint. Zoom in for clearer visualization.
  • Figure 5: An overall pipeline of the proposed M2P-Tracker for test-time optimization TAP. The proposed M2P-Tracker uses our M2Pv3-S/16 backbone for feature extraction, providing strong temporal prior for more effective online adaptation.
  • ...and 6 more figures