Table of Contents
Fetching ...

MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation

Shengjing Tian, Yinan Han, Xiantong Zhao, Xuehu Liu, Qi Lang

TL;DR

This work targets 3D LiDAR visual object tracking under high temporal variation (HTV), where traditional memory-based trackers suffer quadratic complexity and temporal redundancy. It introduces MambaTrack3D, which combines a Mamba-based Inter-frame Propagation (MIP) module for near-linear, geometry-aware feature propagation with a Grouped Feature Enhancement Module (GFEM) to separate foreground and background semantics and reduce redundant memory. The approach achieves strong HTV performance on KITTI-HTV and nuScenes-HTV while preserving competitive accuracy in standard tracking, and it runs at real-time speeds thanks to the linear-time state-space modeling. The results demonstrate a favorable accuracy–efficiency trade-off and robust generalization to conventional tracking, making it suitable for real-world autonomous perception systems.

Abstract

Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.

MambaTrack3D: A State Space Model Framework for LiDAR-Based Object Tracking under High Temporal Variation

TL;DR

This work targets 3D LiDAR visual object tracking under high temporal variation (HTV), where traditional memory-based trackers suffer quadratic complexity and temporal redundancy. It introduces MambaTrack3D, which combines a Mamba-based Inter-frame Propagation (MIP) module for near-linear, geometry-aware feature propagation with a Grouped Feature Enhancement Module (GFEM) to separate foreground and background semantics and reduce redundant memory. The approach achieves strong HTV performance on KITTI-HTV and nuScenes-HTV while preserving competitive accuracy in standard tracking, and it runs at real-time speeds thanks to the linear-time state-space modeling. The results demonstrate a favorable accuracy–efficiency trade-off and robust generalization to conventional tracking, making it suitable for real-world autonomous perception systems.

Abstract

Dynamic outdoor environments with high temporal variation (HTV) pose significant challenges for 3D single object tracking in LiDAR point clouds. Existing memory-based trackers often suffer from quadratic computational complexity, temporal redundancy, and insufficient exploitation of geometric priors. To address these issues, we propose MambaTrack3D, a novel HTV-oriented tracking framework built upon the state space model Mamba. Specifically, we design a Mamba-based Inter-frame Propagation (MIP) module that replaces conventional single-frame feature extraction with efficient inter-frame propagation, achieving near-linear complexity while explicitly modeling spatial relations across historical frames. Furthermore, a Grouped Feature Enhancement Module (GFEM) is introduced to separate foreground and background semantics at the channel level, thereby mitigating temporal redundancy in the memory bank. Extensive experiments on KITTI-HTV and nuScenes-HTV benchmarks demonstrate that MambaTrack3D consistently outperforms both HTV-oriented and normal-scenario trackers, achieving improvements of up to 6.5 success and 9.5 precision over HVTrack under moderate temporal gaps. On the standard KITTI dataset, MambaTrack3D remains highly competitive with state-of-the-art normal-scenario trackers, confirming its strong generalization ability. Overall, MambaTrack3D achieves a superior accuracy-efficiency trade-off, delivering robust performance across both specialized HTV and conventional tracking scenarios.

Paper Structure

This paper contains 18 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Performance comparisons between transformer-based method and our method. The X-axis is the number of input points. The FLOPs on the left Y-axis reflects the computation complexity, while the FPS on the right Y-axis represents the running speed of different methods.
  • Figure 2: The overall pipeline of the proposed method.
  • Figure 3: The pipeline of the proposed MIP module.
  • Figure 4: The pipeline of the proposed GFEM module.
  • Figure 5: Visualization of Tracking results on KITTI. We plot fives tracklets with different temporal interval, i.e., interval=1, 2, 3, 5, 10. MambaTrack3D is highlighted in orange, and the ground-truth in green. The best view can zoom-in.
  • ...and 2 more figures