Table of Contents
Fetching ...

Can Visual Foundation Models Achieve Long-term Point Tracking?

Görkay Aydemir, Weidi Xie, Fatma Güney

TL;DR

The paper investigates whether visual foundation models can sustain reliable long-term point tracking by evaluating geometry-aware representations in zero-shot, probing, and LoRA-based adaptation settings. Using correlation maps, it compares a broad model suite, finding that Stable Diffusion dominates zero-shot geometric awareness while DINOv2 can rival supervised methods when lightly adapted. The work demonstrates that foundation models offer robust initialization for correspondence learning and highlights the potential for low-parameter adaptation to achieve competitive tracking performance. It also points to future directions involving multi-frame integration to better handle occlusions and feature drift in long sequences.

Abstract

Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.

Can Visual Foundation Models Achieve Long-term Point Tracking?

TL;DR

The paper investigates whether visual foundation models can sustain reliable long-term point tracking by evaluating geometry-aware representations in zero-shot, probing, and LoRA-based adaptation settings. Using correlation maps, it compares a broad model suite, finding that Stable Diffusion dominates zero-shot geometric awareness while DINOv2 can rival supervised methods when lightly adapted. The work demonstrates that foundation models offer robust initialization for correspondence learning and highlights the potential for low-parameter adaptation to achieve competitive tracking performance. It also points to future directions involving multi-frame integration to better handle occlusions and feature drift in long sequences.

Abstract

Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.
Paper Structure (10 sections, 3 equations, 2 figures, 3 tables)

This paper contains 10 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Correlation Map. The correlation map represents the similarity between frame features $\mathbf{F}_{t}$ and a query feature $\mathbf{q}$.
  • Figure 2: Qualitative Results of Zero-Shot Point Tracking on TAP-Vid DAVIS. This figure shows the performance of zero-shot point tracking on TAP-Vid DAVIS. Query points from three videos are used to generate correlation map comparing the sampled query feature to features from different frames. Real correspondences are shown in the upper row. The models evaluated are Stable Diffusion Rombach2022CVPR, DINOv2 Oquab2023ARXIV, and SAM Kirillov2023ICCV. Predictions are indicated by red stars, representing the most similar locations, while ground truth correspondences are shown as blue stars. Red lines connect them to illustrate the spatial difference, or error, between the ground truth and the prediction. Warmer colors in the cost volumes indicate higher similarity.