Table of Contents
Fetching ...

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

Weiyi Lv, Yuhang Huang, Ning Zhang, Ruei-Sung Lin, Mei Han, Dan Zeng

TL;DR

The DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction, and optimizes the diffusion process with much fewer sampling steps.

Abstract

In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion, we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically, for the motion predictor component, we propose a novel Decoupled Diffusion-based Motion Predictor (D$^2$MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore, it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with $62.3\%$ and $76.2\%$ in HOTA metrics, respectively. To the best of our knowledge, DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

TL;DR

The DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction, and optimizes the diffusion process with much fewer sampling steps.

Abstract

In Multiple Object Tracking, objects often exhibit non-linear motion of acceleration and deceleration, with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion, we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically, for the motion predictor component, we propose a novel Decoupled Diffusion-based Motion Predictor (DMP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore, it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with and in HOTA metrics, respectively. To the best of our knowledge, DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction.
Paper Structure (25 sections, 16 equations, 10 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 16 equations, 10 figures, 10 tables, 1 algorithm.

Figures (10)

  • Figure 1: (a) illustrates the trajectories of DiffMOT on sampled sequences of DanceTrack. Each object's center position along the 200 frames is plotted in the 3D coordinates. The objects in DanceTrack exhibit non-linear motion trajectories. Trackers with the KF predictor will fail in tracking in frame 30 for the inaccurate prediction, while our DiffMOT with D$^2$MP can track successfully. (b) shows the HOTA-IDF1-FPS comparisons of different trackers. Our DiffMOT with the YOLOX-X detector achieves $62.3\%$ HOTA, $63.0\%$ IDF1 on the DanceTrack test set with $22.7$ FPS. (c) shows the motion prediction of the linear Kalman Filter on different datasets. The average IoU of the predicted and ground truth bounding boxes are used as the metric to demonstrate the linear (high IoU) and non-linear (low IoU) characteristics of each dataset.
  • Figure 2: The overall architecture of DiffMOT. DiffMOT consists of three parts: detection, motion prediction, and association.
  • Figure 3: The overall architecture of D$^2$MP. D$^2$MP consists of the forward process and the reversed process. In the forward process, data to zero and zero to noise processes are enclosed within the blue dashed box. In the reversed process, HMINet is enclosed within the orange dashed box. $p_{\boldsymbol{\Theta}}$ refers to the operation introduced in Eq. \ref{['eq7']} to reconstruct $\hat{\mathbf{M}}_{f, 0}$
  • Figure 4: Qualitative comparison between using KF or D$^2$MP as the motion model on the DanceTrack test set. The upper row represents the results predicted by KF, while the lower row represents the results predicted by D$^2$MP. The red arrow indicates the noteworthy objects. Boxes of the same color represent the same ID. Best viewed in color and zoom-in.
  • Figure 5: The architecture of D$^2$MP-TB. The distinction from D$^2$MP-OB is enclosed within the red dashed box.
  • ...and 5 more figures