Table of Contents
Fetching ...

S3MOT: Monocular 3D Object Tracking with Selective State Space Model

Zhuohao Yan, Shaoquan Feng, Xingxing Li, Yuxuan Zhou, Chunxi Xia, Shengyu Li

TL;DR

Monocular 3D MOT is challenged by weak cue fusion and non-differentiable associations. The authors propose S3MOT, a selective state-space framework that fuses appearance (FCOE), velocity-aware motion (VeloSSM), and differentiable data association (HSSM) for robust, real-time tracking. On KITTI, S3MOT achieves $76.86$ in $HOTA$ and $77.41$ in $AssA$ at 31 FPS, significantly surpassing prior monocular methods. This work delivers a scalable, end-to-end monocular 3D MOT approach with strong multi-cue integration and practical deployment potential.

Abstract

Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.

S3MOT: Monocular 3D Object Tracking with Selective State Space Model

TL;DR

Monocular 3D MOT is challenged by weak cue fusion and non-differentiable associations. The authors propose S3MOT, a selective state-space framework that fuses appearance (FCOE), velocity-aware motion (VeloSSM), and differentiable data association (HSSM) for robust, real-time tracking. On KITTI, S3MOT achieves in and in at 31 FPS, significantly surpassing prior monocular methods. This work delivers a scalable, end-to-end monocular 3D MOT approach with strong multi-cue integration and practical deployment potential.

Abstract

Accurate and reliable multi-object tracking (MOT) in 3D space is essential for advancing robotics and computer vision applications. However, it remains a significant challenge in monocular setups due to the difficulty of mining 3D spatiotemporal associations from 2D video streams. In this work, we present three innovative techniques to enhance the fusion and exploitation of heterogeneous cues for monocular 3D MOT: (1) we introduce the Hungarian State Space Model (HSSM), a novel data association mechanism that compresses contextual tracking cues across multiple paths, enabling efficient and comprehensive assignment decisions with linear complexity. HSSM features a global receptive field and dynamic weights, in contrast to traditional linear assignment algorithms that rely on hand-crafted association costs. (2) We propose Fully Convolutional One-stage Embedding (FCOE), which eliminates ROI pooling by directly using dense feature maps for contrastive learning, thus improving object re-identification accuracy under challenging conditions such as varying viewpoints and lighting. (3) We enhance 6-DoF pose estimation through VeloSSM, an encoder-decoder architecture that models temporal dependencies in velocity to capture motion dynamics, overcoming the limitations of frame-based 3D inference. Experiments on the KITTI public test benchmark demonstrate the effectiveness of our method, achieving a new state-of-the-art performance of 76.86~HOTA at 31~FPS. Our approach outperforms the previous best by significant margins of +2.63~HOTA and +3.62~AssA, showcasing its robustness and efficiency for monocular 3D MOT tasks. The code and models are available at https://github.com/bytepioneerX/s3mot.

Paper Structure

This paper contains 17 sections, 13 equations, 6 figures, 4 tables, 3 algorithms.

Figures (6)

  • Figure 1: (a) contrasts our context-aware tracking cues with traditional context-free methods. Traditional approaches minimize hand-crafted or linearly learned association costs, disregarding interactions between matching pairs. In contrast, our method leverages a four-way scanning mechanism to capture rich contextual knowledge, facilitating efficient information exchange through a global receptive field and dynamic weighting. (b) presents qualitative results of S3MOT in the challenging scenes. Despite occlusion and fast motion disrupting some tracking cues, our method leverages contextual information for more robust data fusion. Red arrows highlight objects with ambiguous tracking cues. (c) illustrates the HOTA-AssA comparisons of different trackers. Our monocular 3D tracker S3MOT achieves a new state-of-the-art performance with 76.86 HOTA and 77.41 AssA on the KITTI test benchmark.
  • Figure 2: Architecture details of S3MOT. The current and historical frame images pass through DD3D to estimate the instance category, 2D bounding box, 3D bounding box, and center-ness. FCOE then extracts Re-ID features from the deep feature map. VeloSSM-P encodes the tracklet flow to predict the current frame's states. HSSM fuses heterogeneous tracking cues to compute a soft association matrix. Finally, VeloSSM-U balances observation and prediction using tracklet flow and confidence, producing the refined tracklet states.
  • Figure 3: Illustration of feature-dense similarity learning. Dense features, comprising high center-ness features (open circles) and low center-ness features (filled circles), are leveraged to construct a discriminative feature space. The feature-dense instance similarity loss operates by comparing dense feature pairs between the keyframe and reference frame, promoting the separation of feature embeddings belonging to different objects while simultaneously pulling embeddings of the same object closer.
  • Figure 4: Overview of SS2D in our Hungarian State Space Model (HSSM). The input is processed along four separate scan paths, each handled by an independent Mamba block. The outputs from these blocks are then combined via a bidirectional merging process to generate the final 2D feature map.
  • Figure 5: Examples of 3D MOT results on the KITTI tracking benchmark. The first four rows show tracked objects projected onto the image plane, with distinct object IDs displayed above each 3D bounding box. In the final row, detected object states for the current frame are represented by bounding boxes in the LiDAR coordinate system, with historical trajectories illustrated using ellipse symbols.
  • ...and 1 more figures