Table of Contents
Fetching ...

MOT20: A benchmark for multi object tracking in crowded scenes

Patrick Dendorfer, Hamid Rezatofighi, Anton Milan, Javen Shi, Daniel Cremers, Ian Reid, Stefan Roth, Konrad Schindler, Laura Leal-Taixé

TL;DR

MOT20 extends the MOTChallenge benchmark with 8 densely crowded sequences across three scenes to stress-test multi-object tracking in challenging, realistic crowds. It establishes a standardized evaluation framework using CLEAR and track-quality metrics, with public Faster R-CNN detections and a consistent annotation/data-format protocol to separate target pedestrians from distractors. The paper details annotation rules, dataset characteristics, and the evaluation pipeline, including a Hungarian-based tracker-to-target assignment and IoU distance threshold, to enable fair cross-method comparisons. By emphasizing generalization to unseen scenes and crowded scenarios, MOT20 aims to push the development of more robust, crowd-capable tracking systems with practical impact in surveillance and analytics.

Abstract

Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16, and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4thBMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios.

MOT20: A benchmark for multi object tracking in crowded scenes

TL;DR

MOT20 extends the MOTChallenge benchmark with 8 densely crowded sequences across three scenes to stress-test multi-object tracking in challenging, realistic crowds. It establishes a standardized evaluation framework using CLEAR and track-quality metrics, with public Faster R-CNN detections and a consistent annotation/data-format protocol to separate target pedestrians from distractors. The paper details annotation rules, dataset characteristics, and the evaluation pipeline, including a Hungarian-based tracker-to-target assignment and IoU distance threshold, to enable fair cross-method comparisons. By emphasizing generalization to unseen scenes and crowded scenarios, MOT20 aims to push the development of more robust, crowd-capable tracking systems with practical impact in surveillance and analytics.

Abstract

Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16, and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4thBMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios.

Paper Structure

This paper contains 16 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An overview of the MOT20 dataset. The dataset consists of 8 different sequences from 3 different scenes. The test dataset has two known and one unknown scene. Top: training sequences; bottom: test sequences.
  • Figure 2: we provide for the challenges. Left: Image of each frame of the sequences; middle: ground truth labels including all classes. Only provided for training set; right: public detections from trained Faster R-CNN.
  • Figure 3: Four cases illustrating tracker-to-target assignments. (a) An ID switch occurs when the mapping switches from the previously assigned red track to the blue one. (b) A track fragmentation is counted in frame 3 because the target is tracked in frames 1-2, then interrupts, and then reacquires its 'tracked' status at a later point. A new (blue) track hypothesis also causes an ID switch at this point. (c) Although the tracking results is reasonably good, an optimal single-frame assignment in frame 1 is propagated through the sequence, causing 5 missed targets (FN) and 4 false positives (FP). Note that no fragmentations are counted in frames 3 and 6 because tracking of those targets is not resumed at a later point. (d) A degenerate case illustrating that target re-identification is not handled correctly. An interrupted ground truth trajectory will cause a fragmentation. Note the less intuitive ID switch, which is counted because blue is the closest target in frame 5 that is not in conflict with the mapping in frame 4.
  • Figure 4: The annotations include different classes. The target class are pedestrians (left). Besides pedestrians there exist special classes in the data such as static person and non-motorized vehicles (non mot vhcl). However, these classes are filter out during evaluation and do not effect the test score. Thirdly, we annotate occluders and crowds.