Table of Contents
Fetching ...

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Narek Tumanyan, Assaf Singer, Shai Bagon, Tali Dekel

TL;DR

DINO-Tracker introduces a self-supervised, test-time training paradigm for long-term dense point tracking in video by refining pre-trained DINOv2 features with a residual network (Delta-DINO) and coupling this refinement with a learnable tracker. The framework leverages both short-term optical-flow cues ($L_{ exttt{flow}}$) and semantic feature correspondences via DINO best-buddies as supervision, along with refined best-buddies, cycle-consistency, and prior-preservation losses to produce trajectory-consistent embeddings. It achieves state-of-the-art performance among self-supervised trackers and remains competitive with supervised trackers, especially in scenarios with long occlusions, while maintaining efficiency (per-video training on a single GPU). This work demonstrates the value of external priors from self-supervised vision models for dense video tracking and opens avenues for leveraging semantic priors in test-time adaptation, with limitations around occluders and potential identity switches under highly ambiguous scenes.

Abstract

We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

TL;DR

DINO-Tracker introduces a self-supervised, test-time training paradigm for long-term dense point tracking in video by refining pre-trained DINOv2 features with a residual network (Delta-DINO) and coupling this refinement with a learnable tracker. The framework leverages both short-term optical-flow cues () and semantic feature correspondences via DINO best-buddies as supervision, along with refined best-buddies, cycle-consistency, and prior-preservation losses to produce trajectory-consistent embeddings. It achieves state-of-the-art performance among self-supervised trackers and remains competitive with supervised trackers, especially in scenarios with long occlusions, while maintaining efficiency (per-video training on a single GPU). This work demonstrates the value of external priors from self-supervised vision models for dense video tracking and opens avenues for leveraging semantic priors in test-time adaptation, with limitations around occluders and potential identity switches under highly ambiguous scenes.

Abstract

We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.
Paper Structure (47 sections, 11 equations, 8 figures, 5 tables)

This paper contains 47 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: DINO-Tracker provides long-range dense trajectories, past repeating occlusions and during challenging object deformations (a); For visualization purposes, the trajectories are shown for sampled points, yet our method tracks any point. Our test-time training framework leverages a pre-trained DINO-ViT model, and optimizes its internal features for tracking in a single video. (b) Visualization of trajectory features using t-SNE: We reduce the dimensionality of foreground features extracted from all frames to 3D using t-SNE, for both raw DINO features and our optimized ones; Features sampled along ground-truth trajectories are marked in color, where each color indicates a different trajectory. Our refined features exhibit tight "trajectory-clusters", allowing our method to associate matching points across distant frames and occlusion.
  • Figure 2: DINO-Tracker at inference: Features are extracted from a reference frame $\mathbf{I}^k$, and a target frame $\mathbf{I}^t$. Our feature extractor consists of a fixed pre-trained DINOv2 model, and our CNN Delta-DINO model, which predicts a residual to DINO's features. To track a query point $\mathbf{x}_{q} \in \mathbf{I}^{k}$, we compute the cost volume between its sampled feature $\pmb{\varphi}_{q}$, and the target feature map $\mathbf{\Phi}{(\mathbf{I}^t)}$. The resulting heatmap $\mathbf{S}$ is refined, and the final tracked position $\hat{\mathbf{x}_{}}^t$ is estimated based on points in the vicinity of the maximal location.
  • Figure 3: Visibility via trajectory agreement. To determine the visibility of $\mathbf{x}_{\mathbf{q}}$ at time $t\!=\!o$, we track $\hat{\mathbf{x}_{}}^o$ across time and check the agreement between $\Pi{(\hat{\mathbf{x}_{}}^o, t)}$ and $\Pi{(\mathbf{x}_{}, t)}$. This is done by measuring $d_{k_1},d_{k_2}$ -- displacements between the (black and red) tracks for anchor time steps $k_1, k_2$. Since these displacements are large, we classify $\mathbf{x}_{\mathbf{q}}$ as occluded for $t\!=\!o$. For $t\!=\!v$, the track $\Pi{(\hat{\mathbf{x}_{}}^v, t)}$ (green) agrees with $\Pi{(\mathbf{x}_{}, t)}$, thus $\mathbf{x}_{\mathbf{q}}$ is classified as visible for $t\!=\!v$.
  • Figure 4: Qualitative results on TAP-Vid-DAVIS (480) Query points are color-coded on a reference frame (top). Our method exhibits better association of tracks across occlusions compared to SOTA trackers. Full videos and additional results are in the supplementary materials (SM) on our website.
  • Figure 5: Sample results on BADJA w.r.t. ground truth. Query points are color-coded on the frame at the top. Tracked points are marked on the target frames. Red lines indicate tracking errors w.r.t. the ground truth positions.
  • ...and 3 more figures