Table of Contents
Fetching ...

MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking

Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

TL;DR

This work addresses the inadequacy of Kalman-filter-based motion models for multi-object tracking in nonlinear, occlusion-rich scenarios by introducing MambaMOT, an online MOT approach built on the efficient state-space model Mamba. MambaMOT predicts next tracklet locations with a Mamba-based motion block and a dedicated prediction head, and extends to MambaMOT+ by extracting trajectory embeddings to enable tracklet merging with reduced computational cost. Across DanceTrack and SportsMOT, MambaMOT and especially MambaMOT+ achieve substantial gains in HOTA and IDF1, while maintaining real-time speeds (~28.8 FPS) on a single RTX 4080, demonstrating practical viability in complex motion regimes. The results establish that learning-based motion modeling with trajectory-aware merging can surpass Kalman-filter-based approaches in robustness and efficiency for MOT in dynamic environments.

Abstract

In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman filter with a learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman filter-based tracker. In this paper, our proposed method MambaMOT and MambaMOT+, demonstrate advanced performance on challenging MOT datasets such as DanceTrack and SportsMOT, showcasing their ability to handle intricate, non-linear motion patterns and frequent occlusions more effectively than traditional methods.

MambaMOT: State-Space Model as Motion Predictor for Multi-Object Tracking

TL;DR

This work addresses the inadequacy of Kalman-filter-based motion models for multi-object tracking in nonlinear, occlusion-rich scenarios by introducing MambaMOT, an online MOT approach built on the efficient state-space model Mamba. MambaMOT predicts next tracklet locations with a Mamba-based motion block and a dedicated prediction head, and extends to MambaMOT+ by extracting trajectory embeddings to enable tracklet merging with reduced computational cost. Across DanceTrack and SportsMOT, MambaMOT and especially MambaMOT+ achieve substantial gains in HOTA and IDF1, while maintaining real-time speeds (~28.8 FPS) on a single RTX 4080, demonstrating practical viability in complex motion regimes. The results establish that learning-based motion modeling with trajectory-aware merging can surpass Kalman-filter-based approaches in robustness and efficiency for MOT in dynamic environments.

Abstract

In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman filter with a learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman filter-based tracker. In this paper, our proposed method MambaMOT and MambaMOT+, demonstrate advanced performance on challenging MOT datasets such as DanceTrack and SportsMOT, showcasing their ability to handle intricate, non-linear motion patterns and frequent occlusions more effectively than traditional methods.
Paper Structure (18 sections, 6 equations, 3 figures, 2 tables)

This paper contains 18 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A comparison between MambaMOT's and Kalman filter's tracking result. The left visualized results compare the predicted bounding boxes of our proposed MambaMOT and the state-of-the-art method ByteTrack, with the blue and green bounding boxes denoting the predicted bounding box of the target, and the red bounding box is the ground truth. The right figure compares the IoU between the prediction and ground truth bounding boxes; MambaMOT consistently demonstrates better location prediction accuracy compared to the Kalman filter. Best viewed zoomed in and in color.
  • Figure 2: (Left) The MambaMOT$^{+}$ architecture processes a sequence of bounding boxes from the same track through a linear projection layer for motion modeling. The model generates predictions and embeddings, updating the hidden state $h_T$ at each time frame. These predictions are used for detecting and matching tracks, while trajectory embeddings aid in merging tracklets. The detailed structure of the Mamba block is integral to this framework.
  • Figure 3: Some randomly sampled trajectory visualizations between MambaMOT (Blue) and ByteTrack's (Green) compared with Ground truth trajectory (Red) on DanceTrack dataset. MambaMOT demonstrate better tracking accuracy in most of the cases.