Table of Contents
Fetching ...

ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model

Lifan Jiang, Zhihui Wang, Siqi Yin, Guangxiao Ma, Peng Zhang, Boxi Wu

TL;DR

ConsistencyTrack reframes MOT as a generative denoising process guided by a Consistency Model, enabling fast joint detection and tracking across two frames. It employs a two-frame YOLOX backbone plus a diffusion head with Spatial-Temporal Fusion and an association-score head to perform one-step denoising for detection and matching. The method achieves strong results on MOT17 and DanceTrack, delivering competitive accuracy with significantly improved inference speed relative to DiffusionTrack, and supports flexible speed-accuracy trade-offs via adjustable sampling. This work demonstrates the practical viability of applying Consistency Model principles to MOT and suggests future work to boost precision and extend the approach to other tracking architectures.

Abstract

Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyTrack.

ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model

TL;DR

ConsistencyTrack reframes MOT as a generative denoising process guided by a Consistency Model, enabling fast joint detection and tracking across two frames. It employs a two-frame YOLOX backbone plus a diffusion head with Spatial-Temporal Fusion and an association-score head to perform one-step denoising for detection and matching. The method achieves strong results on MOT17 and DanceTrack, delivering competitive accuracy with significantly improved inference speed relative to DiffusionTrack, and supports flexible speed-accuracy trade-offs via adjustable sampling. This work demonstrates the practical viability of applying Consistency Model principles to MOT and suggests future work to boost precision and extend the approach to other tracking architectures.

Abstract

Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyTrack.
Paper Structure (18 sections, 12 equations, 14 figures, 7 tables)

This paper contains 18 sections, 12 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: The denoising strategies of Consistency Model in the duty of MOT. ConsistencyTrack formulates object association as a denoising diffusion process from paired noise boxes to paired object boxes within two adjacent frames $(k-\Delta k, k)$. Here, $f_{\theta}(\cdot,\cdot)$ represents a one-step denoising process.
  • Figure 2: Consistency Model undergoes training process to establish a mapping that brings points along any trajectory of the PF ODE back to the origin of that trajectory song2023consistency. The same as Fig. \ref{['figmain1']}, $f_{\theta}(\cdot,\cdot)$ represents a one-step denoising process.
  • Figure 4: Training procedure of the proposed ConsistencyTrack. Features are extracted through the backbone network which extracts them from adjacent frames $(k-\Delta k,k)$ in a video sequence. Then, random Gaussian noise is added to the GT boxes according to the noise addition strategy of Consistency Model. These noisy boxes, with corresponding features, are processed by the RoI pooler and then input into the ConsistencyHead for iterative noise removal using three basic modules, ultimately yielding the final detection results. Each basic module contains a self-attention mechanism (SA), a Spatial-temporal fusion module (STF), and a correlation score head (HD). After the post process, the objects between adjacent frames $(k-\Delta k,k)$ are one-to-one associated with their matching scores.
  • Figure 5: Training loss of ConsistencyTrack
  • Figure 6: Visualization of the computation methodology for 3D GIoU. The volumetric intersection and the minimal bounding volume between target representations across consecutive frames are characterized as square frustums.
  • ...and 9 more figures