Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers
Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, Jun Liu
TL;DR
This work tackles unsupervised visual tracking by exploiting the semantic and structural knowledge embedded in pre-trained text-to-image diffusion models. It introduces Diff-Tracker, a two-component framework with an Initial Prompt Learner that learns a target-specific prompt and an Online Prompt Updater that adaptively updates the prompt using target motion information, guided by attention harmonization that fuses cross-attention with self-attention. The optimization objective combines a diffusion-model constraint $L_{DM}$ and a cross-attention supervision term $L$, yielding an overall loss $L = ||\mathcal{M} - 𝔉_1||_2^2 + L_{DM}$, where the integrated attention map is $\mathcal{M} = (1 - \alpha) M_c' + \alpha M_c$, and prompt updates follow $p_k = (1 - \beta) \mathcal{H}_b(p_{k-1} + l_k) + \beta p_{k-1}$. Experiments on five benchmarks show state-of-the-art unsupervised performance, validating that diffusion-model knowledge can be effectively repurposed for robust, label-free visual tracking with practical implications for data-efficient tracking in dynamic environments.
Abstract
We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.
