Table of Contents
Fetching ...

Towards Generalizable Multi-Object Tracking

Zheng Qin, Le Wang, Sanping Zhou, Panpan Fu, Gang Hua, Wei Tang

TL;DR

The paper tackles the generalization gap in MOT by identifying key scenario attributes that influence tracker performance and proposing GeneralTrack, a point-wise to instance-wise relation framework. GeneralTrack uses a multi-scale point-region relation and hierarchical aggregation to avoid manually balancing motion and appearance across diverse scenes, enabling robust cross-scenario tracking. It achieves state-of-the-art results on multiple benchmarks (notably 1st on BDD100K with 57.87 mTETA) and demonstrates strong domain generalization without dataset-specific tuning. The work combines thorough attribute analysis, end-to-end relational modeling, and comprehensive ablations to validate the approach and points toward future work on multi-frame relations.

Abstract

Multi-Object Tracking MOT encompasses various tracking scenarios, each characterized by unique traits. Effective trackers should demonstrate a high degree of generalizability across diverse scenarios. However, existing trackers struggle to accommodate all aspects or necessitate hypothesis and experimentation to customize the association information motion and or appearance for a given scenario, leading to narrowly tailored solutions with limited generalizability. In this paper, we investigate the factors that influence trackers generalization to different scenarios and concretize them into a set of tracking scenario attributes to guide the design of more generalizable trackers. Furthermore, we propose a point-wise to instance-wise relation framework for MOT, i.e., GeneralTrack, which can generalize across diverse scenarios while eliminating the need to balance motion and appearance. Thanks to its superior generalizability, our proposed GeneralTrack achieves state-of-the-art performance on multiple benchmarks and demonstrates the potential for domain generalization. https://github.com/qinzheng2000/GeneralTrack.git

Towards Generalizable Multi-Object Tracking

TL;DR

The paper tackles the generalization gap in MOT by identifying key scenario attributes that influence tracker performance and proposing GeneralTrack, a point-wise to instance-wise relation framework. GeneralTrack uses a multi-scale point-region relation and hierarchical aggregation to avoid manually balancing motion and appearance across diverse scenes, enabling robust cross-scenario tracking. It achieves state-of-the-art results on multiple benchmarks (notably 1st on BDD100K with 57.87 mTETA) and demonstrates strong domain generalization without dataset-specific tuning. The work combines thorough attribute analysis, end-to-end relational modeling, and comprehensive ablations to validate the approach and points toward future work on multi-frame relations.

Abstract

Multi-Object Tracking MOT encompasses various tracking scenarios, each characterized by unique traits. Effective trackers should demonstrate a high degree of generalizability across diverse scenarios. However, existing trackers struggle to accommodate all aspects or necessitate hypothesis and experimentation to customize the association information motion and or appearance for a given scenario, leading to narrowly tailored solutions with limited generalizability. In this paper, we investigate the factors that influence trackers generalization to different scenarios and concretize them into a set of tracking scenario attributes to guide the design of more generalizable trackers. Furthermore, we propose a point-wise to instance-wise relation framework for MOT, i.e., GeneralTrack, which can generalize across diverse scenarios while eliminating the need to balance motion and appearance. Thanks to its superior generalizability, our proposed GeneralTrack achieves state-of-the-art performance on multiple benchmarks and demonstrates the potential for domain generalization. https://github.com/qinzheng2000/GeneralTrack.git
Paper Structure (26 sections, 6 equations, 6 figures, 11 tables)

This paper contains 26 sections, 6 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Adverse cases for some attributes in tracking scenarios.. The line in 'irregular motion' is the trajectory of each target.
  • Figure 2: Tracking scenario attribute maps. Appearance performs poorly in the scenario with a large percentage in the blue area, as well as motion in the white area.
  • Figure 3: Overview of our GeneralTrack. The Feature Relation Extractor obtains global dense relations with frame $t$ for each point in frame $t-1$ by a 4D correlation volume. Then by constructing a correlation pyramid, we transform the global relations into Multi-scale Point-region Relations, and form a relation map for frame $t-1$. Finally, We progressively perform Relational Aggregation to aggregate point-wise relation into instance-wise relation and achieve the association between tracklets and detections.
  • Figure 4: Multi-scale point-region Relation on correlation pyramid. With downsampling, the searching region becomes progressively larger (red, green and blue box). Two examples points $a$ and $b$ are given in the figure, where the green dot and blue dot represent the target point in frame $t-1$ and frame $t$ respectively. (a) Headlights of the car. The car moves very fast with large displacement, its relation point in frame $t$ is not captured until the layer with the largest scale (b) Head of the man. Due to the relatively small movement of people, its relation can be obtained at the first level of the pyramid (the highest resolution). Such a relation searching paradigm can be flexibly adapted to both large and small displacements with low computational resources.
  • Figure 5: Visualization of tracking results comparison. Note that we use exactly the same detection results as for GHOST. The boxes of different colors represent the bounding boxes with different identities. The red bus shown in bold is the target of our comparison. It experienced tracklet interruptions, id switch, and misclassification in GHOST, in the meantime these were resolved in our approach.
  • ...and 1 more figures