Table of Contents
Fetching ...

StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections

Matvei Shelukhan, Timur Mamedov, Karina Kvanchiani

TL;DR

StableTrack tackles the challenge of multi-object tracking under low-frequency detections by decoupling detection from association and introducing a robust two-stage cross-frame matching that leverages a Bbox-Based Distance (BBD) and intermediate-frame visual tracking to refine Kalman Filter predictions. The method integrates Forward VT and Backward VT to predict positions in an intermediate frame, extends the KF state, and employs a two-stage Hungarian-based association that first relies on BBD and appearance, then on IoU with stricter spatial constraints. Key contributions include the BBD formulation, the two-stage matching framework, and the integration of visual tracking to stabilize predictions, which together yield an $11.6\%$ HOTA improvement at $1$ Hz on MOT17-val and strong performance on MOT17, MOT20, and DanceTrack under full-frequency detections. These results demonstrate improved resilience to temporal gaps with practical implications for real-time MOT in resource-constrained environments.

Abstract

Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving $\textit{11.6%}$ HOTA improvement at $\textit{1}$ Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.

StableTrack: Stabilizing Multi-Object Tracking on Low-Frequency Detections

TL;DR

StableTrack tackles the challenge of multi-object tracking under low-frequency detections by decoupling detection from association and introducing a robust two-stage cross-frame matching that leverages a Bbox-Based Distance (BBD) and intermediate-frame visual tracking to refine Kalman Filter predictions. The method integrates Forward VT and Backward VT to predict positions in an intermediate frame, extends the KF state, and employs a two-stage Hungarian-based association that first relies on BBD and appearance, then on IoU with stricter spatial constraints. Key contributions include the BBD formulation, the two-stage matching framework, and the integration of visual tracking to stabilize predictions, which together yield an HOTA improvement at Hz on MOT17-val and strong performance on MOT17, MOT20, and DanceTrack under full-frequency detections. These results demonstrate improved resilience to temporal gaps with practical implications for real-time MOT in resource-constrained environments.

Abstract

Multi-object tracking (MOT) is one of the most challenging tasks in computer vision, where it is important to correctly detect objects and associate these detections across frames. Current approaches mainly focus on tracking objects in each frame of a video stream, making it almost impossible to run the model under conditions of limited computing resources. To address this issue, we propose StableTrack, a novel approach that stabilizes the quality of tracking on low-frequency detections. Our method introduces a new two-stage matching strategy to improve the cross-frame association between low-frequency detections. We propose a novel Bbox-Based Distance instead of the conventional Mahalanobis distance, which allows us to effectively match objects using the Re-ID model. Furthermore, we integrate visual tracking into the Kalman Filter and the overall tracking pipeline. Our method outperforms current state-of-the-art trackers in the case of low-frequency detections, achieving HOTA improvement at Hz on MOT17-val, while keeping up with the best approaches on the standard MOT17, MOT20, and DanceTrack benchmarks with full-frequency detections.

Paper Structure

This paper contains 31 sections, 11 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of HOTA scores on the MOT17 validation set with our StableTrack and other state-of-the-art methods for different detection frames per second (FPS). StableTrack shows the highest and stable results in the case of low-frequency detections, while keeping up with the other methods in high-frequency detections scenario.
  • Figure 2: Scheme of StableTrack. The core stages include visual tracking (Forward and Backward VT) for motion modeling, Bbox-Based Distance (BBD) for similarity measuring and two-stage matching strategy for the cross-frame association.
  • Figure 3: The fixed positional displacement $d$ between Kalman Filter prediction and detected bounding box results in different IoU values for smaller objects (left) and for larger ones (right).
  • Figure 4: Qualitative comparisons of our StableTrack with the baseline --- current state-of-the-art method TrackTrack tracktrack. In the results from the baseline tracker, (a) ID $6$ and $13$ are switched because of high IoU, (b) ID $6$ is changed to $35$ after occlusion, (c) ID $8$ and $9$ are changed after the long occlusion, and (d) ID $9$ and $30$ are changed after the camera motion. In contrast, our TrackTrack shows correct tracking results in every case, demonstrating its robustness.
  • Figure 5: Examples of ASMS asms predictions in challenging scenarios.