Table of Contents
Fetching ...

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, Jun Liu

TL;DR

This work tackles unsupervised visual tracking by exploiting the semantic and structural knowledge embedded in pre-trained text-to-image diffusion models. It introduces Diff-Tracker, a two-component framework with an Initial Prompt Learner that learns a target-specific prompt and an Online Prompt Updater that adaptively updates the prompt using target motion information, guided by attention harmonization that fuses cross-attention with self-attention. The optimization objective combines a diffusion-model constraint $L_{DM}$ and a cross-attention supervision term $L$, yielding an overall loss $L = ||\mathcal{M} - 𝔉_1||_2^2 + L_{DM}$, where the integrated attention map is $\mathcal{M} = (1 - \alpha) M_c' + \alpha M_c$, and prompt updates follow $p_k = (1 - \beta) \mathcal{H}_b(p_{k-1} + l_k) + \beta p_{k-1}$. Experiments on five benchmarks show state-of-the-art unsupervised performance, validating that diffusion-model knowledge can be effectively repurposed for robust, label-free visual tracking with practical implications for data-efficient tracking in dynamic environments.

Abstract

We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

TL;DR

This work tackles unsupervised visual tracking by exploiting the semantic and structural knowledge embedded in pre-trained text-to-image diffusion models. It introduces Diff-Tracker, a two-component framework with an Initial Prompt Learner that learns a target-specific prompt and an Online Prompt Updater that adaptively updates the prompt using target motion information, guided by attention harmonization that fuses cross-attention with self-attention. The optimization objective combines a diffusion-model constraint and a cross-attention supervision term , yielding an overall loss , where the integrated attention map is , and prompt updates follow . Experiments on five benchmarks show state-of-the-art unsupervised performance, validating that diffusion-model knowledge can be effectively repurposed for robust, label-free visual tracking with practical implications for data-efficient tracking in dynamic environments.

Abstract

We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.
Paper Structure (36 sections, 8 equations, 2 figures, 4 tables)

This paper contains 36 sections, 8 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: The framework of the Diff-Tracker consists of the initial prompt learner on the left side of the figure and the online prompt updater on the right. The prompt updated through the online prompt learner is input into the network of initial prompt updater to obtain the output cross-attention map. This map is used to compute the loss for updating the online prompt updater by comparison with the GT cross-attention map.
  • Figure 2: The detailed architectures of the online prompt updater.