ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model
Lifan Jiang, Zhihui Wang, Siqi Yin, Guangxiao Ma, Peng Zhang, Boxi Wu
TL;DR
ConsistencyTrack reframes MOT as a generative denoising process guided by a Consistency Model, enabling fast joint detection and tracking across two frames. It employs a two-frame YOLOX backbone plus a diffusion head with Spatial-Temporal Fusion and an association-score head to perform one-step denoising for detection and matching. The method achieves strong results on MOT17 and DanceTrack, delivering competitive accuracy with significantly improved inference speed relative to DiffusionTrack, and supports flexible speed-accuracy trade-offs via adjustable sampling. This work demonstrates the practical viability of applying Consistency Model principles to MOT and suggests future work to boost precision and extend the approach to other tracking architectures.
Abstract
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame. Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios. However, these methods still face challenges such as poor noise resistance and frequent ID switches. In this research, we propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on perturbed bounding boxes. This progressive denoising strategy significantly improves the model's noise resistance. During the training phase, paired object boxes within two adjacent frames are diffused from ground-truth boxes to a random distribution, and then the model learns to detect and track by reversing this process. In inference, the model refines randomly generated boxes into detection and tracking results through minimal denoising steps. ConsistencyTrack also introduces an innovative target association strategy to address target occlusion. Experiments on the MOT17 and DanceTrack datasets demonstrate that ConsistencyTrack outperforms other compared methods, especially better than DiffusionTrack in inference speed and other performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyTrack.
