Multi-Object Tracking by Hierarchical Visual Representations

Jinkun Cao; Jiangmiao Pang; Kris Kitani

Multi-Object Tracking by Hierarchical Visual Representations

Jinkun Cao, Jiangmiao Pang, Kris Kitani

TL;DR

This work addresses discriminative appearance modeling for multi-object tracking by moving beyond bounding-box semantics to a three-level visual hierarchy consisting of compositional, semantic, and contextual cues. It introduces CSC-Attention to fuse these cues into CSC-tokens and a transformer-based tracker, CSC-Tracker, that performs global association over a horizon $H$ and uses a final association matrix $\mathbf{M}^t \in \mathbb{R}^{(M_t+1) \times N_t}$. Training optimizes an association objective $L_{\text{asso}}$ plus a feature-distance term and detection loss, and inference uses online sliding windows with Hungarian matching. Experiments on MOT17, MOT20, and DanceTrack show state-of-the-art results among transformer-based MOT methods, improved robustness to detection noise, and favorable time efficiency, establishing a strong new baseline.

Abstract

We propose a new visual hierarchical representation paradigm for multi-object tracking. It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information instead of sticking to only the semantic visual cue such as bounding boxes. This compositional-semantic-contextual hierarchy is flexible to be integrated in different appearance-based multi-object tracking methods. We also propose an attention-based visual feature module to fuse the hierarchical visual representations. The proposed method achieves state-of-the-art accuracy and time efficiency among query-based methods on multiple multi-object tracking benchmarks.

Multi-Object Tracking by Hierarchical Visual Representations

TL;DR

and uses a final association matrix

. Training optimizes an association objective

plus a feature-distance term and detection loss, and inference uses online sliding windows with Hungarian matching. Experiments on MOT17, MOT20, and DanceTrack show state-of-the-art results among transformer-based MOT methods, improved robustness to detection noise, and favorable time efficiency, establishing a strong new baseline.

Abstract

Paper Structure (13 sections, 5 equations, 3 figures, 10 tables)

This paper contains 13 sections, 5 equations, 3 figures, 10 tables.

Introduction
Related Works
Method
Overall Architecture
CSC-Attention
Training and Inference
Experiments
Experiment Setups
Benchmark Results
Ablation Study
Robustness to Detection Noise
Time Efficiency
Conclusion

Figures (3)

Figure 1: With a close look at distinct compositional visual regions, we can recognize certain individuals much more easily.
Figure 2: The architecture of CSC-Tracker . The left half illustrates the overall architecture. The right half is the zoomed-in CSC-Attention module. Our contributions are (1) the visual hierarchy for feature extraction and (2) the CSC-Attention module for feature fusion.
Figure 3: Upper line: Results from DanceTrack-test set where targets have occlusion, crossover and similar appearance. Bottom line: Results on a MOT20-test video where the pedestrians are in the crowd and heavily occluded.

Multi-Object Tracking by Hierarchical Visual Representations

TL;DR

Abstract

Multi-Object Tracking by Hierarchical Visual Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (3)