Table of Contents
Fetching ...

DiffusionTrack: Diffusion Model For Multi-Object Tracking

Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang

TL;DR

DiffusionTrack reframes multi-object tracking as a denoising diffusion process over paired bounding boxes across two frames, enabling joint detection and association within a single, consistent model. Built on a two-frame conditioned diffusion head with a spatial-temporal fusion module and a robust training/inference strategy, it decouples training from inference dynamics and supports dynamic adjustments in proposal counts and refinement steps. Empirical results on MOT17, MOT20, and DanceTrack show state-of-the-art performance among one-stage MOT trackers and competitive results overall, with strong robustness to detection perturbations. The work highlights diffusion models as a promising direction for MOT, offering a simple yet effective baseline with potential for further efficiency improvements and extension to diverse scenes.

Abstract

Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.

DiffusionTrack: Diffusion Model For Multi-Object Tracking

TL;DR

DiffusionTrack reframes multi-object tracking as a denoising diffusion process over paired bounding boxes across two frames, enabling joint detection and association within a single, consistent model. Built on a two-frame conditioned diffusion head with a spatial-temporal fusion module and a robust training/inference strategy, it decouples training from inference dynamics and supports dynamic adjustments in proposal counts and refinement steps. Empirical results on MOT17, MOT20, and DanceTrack show state-of-the-art performance among one-stage MOT trackers and competitive results overall, with strong robustness to detection perturbations. The work highlights diffusion models as a promising direction for MOT, offering a simple yet effective baseline with potential for further efficiency improvements and extension to diverse scenes.

Abstract

Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.
Paper Structure (16 sections, 5 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 16 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: DiffusionTrack formulates object association as a denoising diffusion process from paired noise boxes to paired object boxes within two adjacent frames $t-1$ and $t$. The diffusion head receives the two-frame image information extracted by the frozen backbone and then iteratively denoises the paired noise boxes to obtain the final paired object boxes.
  • Figure 2: The architecture of DiffusionTrack. Given the images and corresponding ground-truth in the frame t and frame t-1, we extract features from two adjacent frames through the frozen backbone, then the diffusion head takes paired noise boxes as input and predicts category classification, box coordinates and association score of the same object in two adjacent frames. During training, the noise boxes are constructed by adding Gaussian noise to paired ground-truth boxes of the same object. In inference, the noise boxes are constructed by adding Gaussian noise to the padded prior object boxes in the previous frame.
  • Figure 3: The inference of DiffusionTrack can be divided into three steps: (1) padding repeated prior boxes with given noise boxes until predefined number $N_{test}$ is reached. (2) adding Gaussian noise to input boxes according to $\mathbf{B}=(1-\alpha_{t}) \cdot \mathbf{B}+\alpha_{t} \cdot \mathbf{B}_{noise}$ under the control of $\alpha_{t}$. (3) getting tracking results by a denoising process with the number of DDIM sampling steps $s$.
  • Figure 4: Intriguing properties of DiffusionTrack. DiffusionTrack obtains performance gain by enlarging proposal box numbers and sampling steps while being robust to detection perturbation compared with the previous tracker.
  • Figure 5: Visualization of the calculation process of 3D GIoU. 3D GIoU and 3D IoU are the volume extended version of the original area ones. The intersection $T_{d}\cap T_{gt}$ and $D_{T_{d},T_{gt}}$ of targets between two adjacent frames are square frustums, thus the volume of them can be calculated in the same way of original GIoU.
  • ...and 1 more figures