Table of Contents
Fetching ...

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, Xinggang Wang

TL;DR

ByteTrack addresses missed detections in MOT by leveraging almost all detections instead of discarding low-score boxes. It introduces BYTE, a two-stage data association that first matches high-score boxes to tracks and then re-associates with low-score boxes to recover occluded objects while filtering background, aided by Kalman-filter predictions and IoU/appearance cues. This simple yet effective approach yields state-of-the-art results on MOT17, MOT20, HiEve, and BDD100K, including 30 FPS performance with YOLOX, and generalizes across diverse trackers. The work demonstrates that maximizing the use of available detections can significantly improve identity preservation and tracking stability in challenging scenes.

Abstract

Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating almost every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. When applied to 9 different state-of-the-art trackers, our method achieves consistent improvement on IDF1 score ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU. ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks. The source code, pre-trained models with deploy versions and tutorials of applying to other trackers are released at https://github.com/ifzhang/ByteTrack.

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

TL;DR

ByteTrack addresses missed detections in MOT by leveraging almost all detections instead of discarding low-score boxes. It introduces BYTE, a two-stage data association that first matches high-score boxes to tracks and then re-associates with low-score boxes to recover occluded objects while filtering background, aided by Kalman-filter predictions and IoU/appearance cues. This simple yet effective approach yields state-of-the-art results on MOT17, MOT20, HiEve, and BDD100K, including 30 FPS performance with YOLOX, and generalizes across diverse trackers. The work demonstrates that maximizing the use of available detections can significantly improve identity preservation and tracking stability in challenging scenes.

Abstract

Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. Most methods obtain identities by associating detection boxes whose scores are higher than a threshold. The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating almost every detection box instead of only the high score ones. For the low score detection boxes, we utilize their similarities with tracklets to recover true objects and filter out the background detections. When applied to 9 different state-of-the-art trackers, our method achieves consistent improvement on IDF1 score ranging from 1 to 10 points. To put forwards the state-of-the-art performance of MOT, we design a simple and strong tracker, named ByteTrack. For the first time, we achieve 80.3 MOTA, 77.3 IDF1 and 63.1 HOTA on the test set of MOT17 with 30 FPS running speed on a single V100 GPU. ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks. The source code, pre-trained models with deploy versions and tutorials of applying to other trackers are released at https://github.com/ifzhang/ByteTrack.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 13 tables, 1 algorithm.

Figures (5)

  • Figure 1: MOTA-IDF1-FPS comparisons of different trackers on the test set of MOT17. The horizontal axis is FPS (running speed), the vertical axis is MOTA, and the radius of circle is IDF1. Our ByteTrack achieves 80.3 MOTA, 77.3 IDF1 on MOT17 test set with 30 FPS running speed, outperforming all previous trackers. Details are given in Table \ref{['table_mot17']}.
  • Figure 2: Examples of our method which associates every detection box. (a) shows all the detection boxes with their scores. (b) shows the tracklets obtained by previous methods which associates detection boxes whose scores are higher than a threshold, i.e.0.5. The same box color represents the same identity. (c) shows the tracklets obtained by our method. The dashed boxes represent the predicted box of the previous tracklets using Kalman Filter. The two low score detection boxes are correctly matched to the previous tracklets based on the large IoU.
  • Figure 3: Comparison of the performances of BYTE and SORT under different detection score thresholds. The results are from the validation set of MOT17.
  • Figure 4: Comparison of the number of TPs and FPs in all low score detection boxes and the low score tracked boxes obtained by BYTE. The results are from the validation set of MOT17.
  • Figure 5: Visualization results of ByteTrack. We select 6 sequences from the validation set of MOT17 and show the effectiveness of ByteTrack to handle difficult cases such as occlusion and motion blur. The yellow triangle represents the high score box and the red triangle represents the low score box. The same box color represents the same identity.