MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection
Xiangbo Gao, Asiegbu Miracle Kanu-Asiegbu, Xiaoxiao Du
TL;DR
MambaST tackles robust pedestrian detection in autonomous driving by fusing RGB and thermal data across time using a Mamba-based spatial-temporal framework. The core contributions are the MHHPA module for multi-scale cross-spectral fusion, an order-aware concatenation scheme, and a recurrent temporal fusion mechanism that enables efficient inference while preserving fine-grained detail. Empirical results on KAIST show MambaST achieving state-of-the-art or competitive performance, especially for small pedestrians and in night conditions, with significantly fewer parameters and lower GFLOPs than transformer-based baselines. This work provides a practical, plug-and-play approach that enhances cross-spectral pedestrian detection in real-time settings and offers a pathway for deploying multi-modal temporal fusion in autonomous systems.
Abstract
This paper proposes MambaST, a plug-and-play cross-spectral spatial-temporal fusion pipeline for efficient pedestrian detection. Several challenges exist for pedestrian detection in autonomous driving applications. First, it is difficult to perform accurate detection using RGB cameras under dark or low-light conditions. Cross-spectral systems must be developed to integrate complementary information from multiple sensor modalities, such as thermal and visible cameras, to improve the robustness of the detections. Second, pedestrian detection models are latency-sensitive. Efficient and easy-to-scale detection models with fewer parameters are highly desirable for real-time applications such as autonomous driving. Third, pedestrian video data provides spatial-temporal correlations of pedestrian movement. It is beneficial to incorporate temporal as well as spatial information to enhance pedestrian detection. This work leverages recent advances in the state space model (Mamba) and proposes a novel Multi-head Hierarchical Patching and Aggregation (MHHPA) structure to extract both fine-grained and coarse-grained information from both RGB and thermal imagery. Experimental results show that the proposed MHHPA is an effective and efficient alternative to a Transformer model for cross-spectral pedestrian detection. Our proposed model also achieves superior performance on small-scale pedestrian detection. The code is available at https://github.com/XiangboGaoBarry/MambaST}{https://github.com/XiangboGaoBarry/MambaST.
