Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking
Mingzhan Yang, Guangxin Han, Bin Yan, Wenhua Zhang, Jinqing Qi, Huchuan Lu, Dong Wang
TL;DR
Hybrid-SORT tackles multi-object tracking under occlusion and clustering by augmenting traditional strong cues (spatial and appearance) with weak cues: confidence state, height state, and velocity direction. It introduces Tracklet Confidence Modeling (TCM) and Height Modulated IoU (HMIoU) and strengthens motion cues with Robust Observation-Centric Momentum (ROCM), all designed to preserve online, real-time performance. The method is plug-and-play and training-free, generalizing across multiple trackers and benchmarks, with notable gains on DanceTrack, MOT17, and MOT20, and gains amplified when combined with an appearance model (Hybrid-SORT-ReID). The findings highlight the practical value of weak cues for robust association in challenging MOT scenarios, offering a scalable, efficient path to improved tracking in occluded and clustered environments.
Abstract
Multi-Object Tracking (MOT) aims to detect and associate all desired objects across frames. Most methods accomplish the task by explicitly or implicitly leveraging strong cues (i.e., spatial and appearance information), which exhibit powerful instance-level discrimination. However, when object occlusion and clustering occur, spatial and appearance information will become ambiguous simultaneously due to the high overlap among objects. In this paper, we demonstrate this long-standing challenge in MOT can be efficiently and effectively resolved by incorporating weak cues to compensate for strong cues. Along with velocity direction, we introduce the confidence and height state as potential weak cues. With superior performance, our method still maintains Simple, Online and Real-Time (SORT) characteristics. Also, our method shows strong generalization for diverse trackers and scenarios in a plug-and-play and training-free manner. Significant and consistent improvements are observed when applying our method to 5 different representative trackers. Further, with both strong and weak cues, our method Hybrid-SORT achieves superior performance on diverse benchmarks, including MOT17, MOT20, and especially DanceTrack where interaction and severe occlusion frequently happen with complex motions. The code and models are available at https://github.com/ymzis69/HybridSORT.
