Table of Contents
Fetching ...

DeTrack: In-model Latent Denoising Learning for Visual Object Tracking

Xinyu Zhou, Jinglun Li, Lingyi Hong, Kaixun Jiang, Pinxue Guo, Weifeng Ge, Wenqiang Zhang

TL;DR

This work reframes visual object tracking as an in-model latent denoising task inspired by diffusion models, introducing DeTrack which denoises noisy bounding boxes within a denoising Vision Transformer (ViT) comprised of multiple denoising blocks. By conditioning on visual memory and a search region, and by incorporating a trajectory memory and a compound memory, the model achieves robust localization with a single forward pass, enabling real-time performance. Empirical results across AVisT, GOT-10k, LaSOT, and LaSOT$_{ext}$ show competitive accuracy and improved generalization to unseen data, supported by comprehensive ablations on denoising steps and memory configurations. The approach offers a practical, diffusion-inspired paradigm for tracking that balances robustness to noise with efficiency, advancing real-time tracking under challenging conditions.

Abstract

Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.

DeTrack: In-model Latent Denoising Learning for Visual Object Tracking

TL;DR

This work reframes visual object tracking as an in-model latent denoising task inspired by diffusion models, introducing DeTrack which denoises noisy bounding boxes within a denoising Vision Transformer (ViT) comprised of multiple denoising blocks. By conditioning on visual memory and a search region, and by incorporating a trajectory memory and a compound memory, the model achieves robust localization with a single forward pass, enabling real-time performance. Empirical results across AVisT, GOT-10k, LaSOT, and LaSOT show competitive accuracy and improved generalization to unseen data, supported by comprehensive ablations on denoising steps and memory configurations. The approach offers a practical, diffusion-inspired paradigm for tracking that balances robustness to noise with efficiency, advancing real-time tracking under challenging conditions.

Abstract

Previous visual object tracking methods employ image-feature regression models or coordinate autoregression models for bounding box prediction. Image-feature regression methods heavily depend on matching results and do not utilize positional prior, while the autoregressive approach can only be trained using bounding boxes available in the training set, potentially resulting in suboptimal performance during testing with unseen data. Inspired by the diffusion model, denoising learning enhances the model's robustness to unseen data. Therefore, We introduce noise to bounding boxes, generating noisy boxes for training, thus enhancing model robustness on testing data. We propose a new paradigm to formulate the visual object tracking problem as a denoising learning process. However, tracking algorithms are usually asked to run in real-time, directly applying the diffusion model to object tracking would severely impair tracking speed. Therefore, we decompose the denoising learning process into every denoising block within a model, not by running the model multiple times, and thus we summarize the proposed paradigm as an in-model latent denoising learning process. Specifically, we propose a denoising Vision Transformer (ViT), which is composed of multiple denoising blocks. In the denoising block, template and search embeddings are projected into every denoising block as conditions. A denoising block is responsible for removing the noise in a predicted bounding box, and multiple stacked denoising blocks cooperate to accomplish the whole denoising process. Subsequently, we utilize image features and trajectory information to refine the denoised bounding box. Besides, we also utilize trajectory memory and visual memory to improve tracking stability. Experimental results validate the effectiveness of our approach, achieving competitive performance on several challenging datasets.
Paper Structure (18 sections, 10 equations, 5 figures, 10 tables)

This paper contains 18 sections, 10 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Difference of denoising learning paradigm.(a) Diffusion model in image generation task. (b) Diffusion model in object detection task. (c) The proposed In-model latent denoising learning paradigm. The pink box indicates the denoising module. $\times N$ indicates denoising for $N$ times.
  • Figure 2: The overview of model architecture.(a) The model architecture comprises the input representation, the proposed Denoising ViT, and Box Refining and Mapping. It also includes Visual Memory and Trajectory Memory. (b) The proposed Denoising Block within Denoising ViT.
  • Figure 3: Box refining and mapping and the updating of visual memory.(a) Box refining and mapping introduces the trajectory memory to improve tracking performance. (b) Visual memory updating based on collaboratively decision including $s_1$ (IoU score) and $s_2$ (Softmax score).
  • Figure 4: Visualization of the denoising step GOT-10k. The first row is the video GOT-10k-Test-000040, the second row is the video GOT-10k-Test-000003, and the third row is the video GOT-10k-Test-000051.
  • Figure 5: Ablation study of memory on LaSOT. (a) Different visual memory lengths; (b) Different trajectory memory lengths; (c) Different IoU thresholds are applied for template updates; (d) The influence of Softmax thresholds. (e) With or without compound memory.