DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Narek Tumanyan; Assaf Singer; Shai Bagon; Tali Dekel

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Narek Tumanyan, Assaf Singer, Shai Bagon, Tali Dekel

TL;DR

DINO-Tracker introduces a self-supervised, test-time training paradigm for long-term dense point tracking in video by refining pre-trained DINOv2 features with a residual network (Delta-DINO) and coupling this refinement with a learnable tracker. The framework leverages both short-term optical-flow cues ($L_{ exttt{flow}}$) and semantic feature correspondences via DINO best-buddies as supervision, along with refined best-buddies, cycle-consistency, and prior-preservation losses to produce trajectory-consistent embeddings. It achieves state-of-the-art performance among self-supervised trackers and remains competitive with supervised trackers, especially in scenarios with long occlusions, while maintaining efficiency (per-video training on a single GPU). This work demonstrates the value of external priors from self-supervised vision models for dense video tracking and opens avenues for leveraging semantic priors in test-time adaptation, with limitations around occluders and potential identity switches under highly ambiguous scenes.

Abstract

We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

TL;DR

) and semantic feature correspondences via DINO best-buddies as supervision, along with refined best-buddies, cycle-consistency, and prior-preservation losses to produce trajectory-consistent embeddings. It achieves state-of-the-art performance among self-supervised trackers and remains competitive with supervised trackers, especially in scenarios with long occlusions, while maintaining efficiency (per-video training on a single GPU). This work demonstrates the value of external priors from self-supervised vision models for dense video tracking and opens avenues for leveraging semantic priors in test-time adaptation, with limitations around occluders and potential identity switches under highly ambiguous scenes.

Abstract

Paper Structure (47 sections, 11 equations, 8 figures, 5 tables)

This paper contains 47 sections, 11 equations, 8 figures, 5 tables.

Introduction
Related Work
Optical flow.
Learning correspondences from videos
Feedforward models for dense tracking.
Optimization-based tracking.
DINO-ViT Features as local semantic descriptors.
Method
DINO-Tracker
Self-Supervision
Optical flow
Feature correspondences.
Objective
Flow loss.
DINO Best-Buddies Loss.
...and 32 more sections

Figures (8)

Figure 1: DINO-Tracker provides long-range dense trajectories, past repeating occlusions and during challenging object deformations (a); For visualization purposes, the trajectories are shown for sampled points, yet our method tracks any point. Our test-time training framework leverages a pre-trained DINO-ViT model, and optimizes its internal features for tracking in a single video. (b) Visualization of trajectory features using t-SNE: We reduce the dimensionality of foreground features extracted from all frames to 3D using t-SNE, for both raw DINO features and our optimized ones; Features sampled along ground-truth trajectories are marked in color, where each color indicates a different trajectory. Our refined features exhibit tight "trajectory-clusters", allowing our method to associate matching points across distant frames and occlusion.
Figure 2: DINO-Tracker at inference: Features are extracted from a reference frame $\mathbf{I}^k$, and a target frame $\mathbf{I}^t$. Our feature extractor consists of a fixed pre-trained DINOv2 model, and our CNN Delta-DINO model, which predicts a residual to DINO's features. To track a query point $\mathbf{x}_{q} \in \mathbf{I}^{k}$, we compute the cost volume between its sampled feature $\pmb{\varphi}_{q}$, and the target feature map $\mathbf{\Phi}{(\mathbf{I}^t)}$. The resulting heatmap $\mathbf{S}$ is refined, and the final tracked position $\hat{\mathbf{x}_{}}^t$ is estimated based on points in the vicinity of the maximal location.
Figure 3: Visibility via trajectory agreement. To determine the visibility of $\mathbf{x}_{\mathbf{q}}$ at time $t\!=\!o$, we track $\hat{\mathbf{x}_{}}^o$ across time and check the agreement between $\Pi{(\hat{\mathbf{x}_{}}^o, t)}$ and $\Pi{(\mathbf{x}_{}, t)}$. This is done by measuring $d_{k_1},d_{k_2}$ -- displacements between the (black and red) tracks for anchor time steps $k_1, k_2$. Since these displacements are large, we classify $\mathbf{x}_{\mathbf{q}}$ as occluded for $t\!=\!o$. For $t\!=\!v$, the track $\Pi{(\hat{\mathbf{x}_{}}^v, t)}$ (green) agrees with $\Pi{(\mathbf{x}_{}, t)}$, thus $\mathbf{x}_{\mathbf{q}}$ is classified as visible for $t\!=\!v$.
Figure 4: Qualitative results on TAP-Vid-DAVIS (480) Query points are color-coded on a reference frame (top). Our method exhibits better association of tracks across occlusions compared to SOTA trackers. Full videos and additional results are in the supplementary materials (SM) on our website.
Figure 5: Sample results on BADJA w.r.t. ground truth. Query points are color-coded on the frame at the top. Tracked points are marked on the target frames. Red lines indicate tracking errors w.r.t. the ground truth positions.
...and 3 more figures

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

TL;DR

Abstract

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

Authors

TL;DR

Abstract

Table of Contents

Figures (8)