Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Zhengbo Zhang; Li Xu; Duo Peng; Hossein Rahmani; Jun Liu

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Zhengbo Zhang, Li Xu, Duo Peng, Hossein Rahmani, Jun Liu

TL;DR

This work tackles unsupervised visual tracking by exploiting the semantic and structural knowledge embedded in pre-trained text-to-image diffusion models. It introduces Diff-Tracker, a two-component framework with an Initial Prompt Learner that learns a target-specific prompt and an Online Prompt Updater that adaptively updates the prompt using target motion information, guided by attention harmonization that fuses cross-attention with self-attention. The optimization objective combines a diffusion-model constraint $L_{DM}$ and a cross-attention supervision term $L$, yielding an overall loss $L = ||\mathcal{M} - 𝔉_1||_2^2 + L_{DM}$, where the integrated attention map is $\mathcal{M} = (1 - \alpha) M_c' + \alpha M_c$, and prompt updates follow $p_k = (1 - \beta) \mathcal{H}_b(p_{k-1} + l_k) + \beta p_{k-1}$. Experiments on five benchmarks show state-of-the-art unsupervised performance, validating that diffusion-model knowledge can be effectively repurposed for robust, label-free visual tracking with practical implications for data-efficient tracking in dynamic environments.

Abstract

We introduce Diff-Tracker, a novel approach for the challenging unsupervised visual tracking task leveraging the pre-trained text-to-image diffusion model. Our main idea is to leverage the rich knowledge encapsulated within the pre-trained diffusion model, such as the understanding of image semantics and structural information, to address unsupervised visual tracking. To this end, we design an initial prompt learner to enable the diffusion model to recognize the tracking target by learning a prompt representing the target. Furthermore, to facilitate dynamic adaptation of the prompt to the target's movements, we propose an online prompt updater. Extensive experiments on five benchmark datasets demonstrate the effectiveness of our proposed method, which also achieves state-of-the-art performance.

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

TL;DR

and a cross-attention supervision term

, yielding an overall loss

, where the integrated attention map is

, and prompt updates follow

. Experiments on five benchmarks show state-of-the-art unsupervised performance, validating that diffusion-model knowledge can be effectively repurposed for robust, label-free visual tracking with practical implications for data-efficient tracking in dynamic environments.

Abstract

Paper Structure (36 sections, 8 equations, 2 figures, 4 tables)

This paper contains 36 sections, 8 equations, 2 figures, 4 tables.

Introduction
Related Work
Unsupervised visual tracking.
Text-to-image diffusion models.
Preliminaries: Text-to-Image Diffusion Models
Training process.
Cross-attention layers.
Self-attention layers.
Diff-Tracker
Task Definition and Our Framework
Task definition.
Our framework.
Initial Prompt Learner
Attention harmonization.
Learning of initial prompt.
...and 21 more sections

Figures (2)

Figure 1: The framework of the Diff-Tracker consists of the initial prompt learner on the left side of the figure and the online prompt updater on the right. The prompt updated through the online prompt learner is input into the network of initial prompt updater to obtain the output cross-attention map. This map is used to compute the loss for updating the online prompt updater by comparison with the GT cross-attention map.
Figure 2: The detailed architectures of the online prompt updater.

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

TL;DR

Abstract

Diff-Tracker: Text-to-Image Diffusion Models are Unsupervised Trackers

Authors

TL;DR

Abstract

Table of Contents

Figures (2)