Multi-Object Tracking by Hierarchical Visual Representations
Jinkun Cao, Jiangmiao Pang, Kris Kitani
TL;DR
This work addresses discriminative appearance modeling for multi-object tracking by moving beyond bounding-box semantics to a three-level visual hierarchy consisting of compositional, semantic, and contextual cues. It introduces CSC-Attention to fuse these cues into CSC-tokens and a transformer-based tracker, CSC-Tracker, that performs global association over a horizon $H$ and uses a final association matrix $\mathbf{M}^t \in \mathbb{R}^{(M_t+1) \times N_t}$. Training optimizes an association objective $L_{\text{asso}}$ plus a feature-distance term and detection loss, and inference uses online sliding windows with Hungarian matching. Experiments on MOT17, MOT20, and DanceTrack show state-of-the-art results among transformer-based MOT methods, improved robustness to detection noise, and favorable time efficiency, establishing a strong new baseline.
Abstract
We propose a new visual hierarchical representation paradigm for multi-object tracking. It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information instead of sticking to only the semantic visual cue such as bounding boxes. This compositional-semantic-contextual hierarchy is flexible to be integrated in different appearance-based multi-object tracking methods. We also propose an attention-based visual feature module to fuse the hierarchical visual representations. The proposed method achieves state-of-the-art accuracy and time efficiency among query-based methods on multiple multi-object tracking benchmarks.
