DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video
Narek Tumanyan, Assaf Singer, Shai Bagon, Tali Dekel
TL;DR
DINO-Tracker introduces a self-supervised, test-time training paradigm for long-term dense point tracking in video by refining pre-trained DINOv2 features with a residual network (Delta-DINO) and coupling this refinement with a learnable tracker. The framework leverages both short-term optical-flow cues ($L_{ exttt{flow}}$) and semantic feature correspondences via DINO best-buddies as supervision, along with refined best-buddies, cycle-consistency, and prior-preservation losses to produce trajectory-consistent embeddings. It achieves state-of-the-art performance among self-supervised trackers and remains competitive with supervised trackers, especially in scenarios with long occlusions, while maintaining efficiency (per-video training on a single GPU). This work demonstrates the value of external priors from self-supervised vision models for dense video tracking and opens avenues for leveraging semantic priors in test-time adaptation, with limitations around occluders and potential identity switches under highly ambiguous scenes.
Abstract
We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.
