MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

Xiangbo Gao; Asiegbu Miracle Kanu-Asiegbu; Xiaoxiao Du

MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

Xiangbo Gao, Asiegbu Miracle Kanu-Asiegbu, Xiaoxiao Du

TL;DR

MambaST tackles robust pedestrian detection in autonomous driving by fusing RGB and thermal data across time using a Mamba-based spatial-temporal framework. The core contributions are the MHHPA module for multi-scale cross-spectral fusion, an order-aware concatenation scheme, and a recurrent temporal fusion mechanism that enables efficient inference while preserving fine-grained detail. Empirical results on KAIST show MambaST achieving state-of-the-art or competitive performance, especially for small pedestrians and in night conditions, with significantly fewer parameters and lower GFLOPs than transformer-based baselines. This work provides a practical, plug-and-play approach that enhances cross-spectral pedestrian detection in real-time settings and offers a pathway for deploying multi-modal temporal fusion in autonomous systems.

Abstract

This paper proposes MambaST, a plug-and-play cross-spectral spatial-temporal fusion pipeline for efficient pedestrian detection. Several challenges exist for pedestrian detection in autonomous driving applications. First, it is difficult to perform accurate detection using RGB cameras under dark or low-light conditions. Cross-spectral systems must be developed to integrate complementary information from multiple sensor modalities, such as thermal and visible cameras, to improve the robustness of the detections. Second, pedestrian detection models are latency-sensitive. Efficient and easy-to-scale detection models with fewer parameters are highly desirable for real-time applications such as autonomous driving. Third, pedestrian video data provides spatial-temporal correlations of pedestrian movement. It is beneficial to incorporate temporal as well as spatial information to enhance pedestrian detection. This work leverages recent advances in the state space model (Mamba) and proposes a novel Multi-head Hierarchical Patching and Aggregation (MHHPA) structure to extract both fine-grained and coarse-grained information from both RGB and thermal imagery. Experimental results show that the proposed MHHPA is an effective and efficient alternative to a Transformer model for cross-spectral pedestrian detection. Our proposed model also achieves superior performance on small-scale pedestrian detection. The code is available at https://github.com/XiangboGaoBarry/MambaST}{https://github.com/XiangboGaoBarry/MambaST.

MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

TL;DR

Abstract

Paper Structure (21 sections, 4 equations, 3 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 4 equations, 3 figures, 4 tables, 2 algorithms.

INTRODUCTION
RELATED WORK
Preliminary on Mamba and Vision Mamba
Cross-Modality Fusion Methods
Temporal Fusion for Video Understanding
METHODOLOGY
Overview of MambaST Model Architecture
Input Embedding
Multi-head Hierarchical Patching and Aggregation
Order-aware Concatenation and Flattening
Recurrent Structure for Temporal Fusion
EXPERIMENTAL RESULTS
Dataset and Evaluation Metric
Implementation Details
Comparison with Other Cross-Modal Fusion Methods
...and 6 more sections

Figures (3)

Figure 1: Visualization of the RGB and thermal object detection network. $D$ denotes the multiplication factor for channel size.
Figure 2: The proposed MambaST pipeline. The input RGB and thermal embeddings are passed through a novel Multi-head Hierarchical Patching and Aggregration (MHHPA) module to extract hierarchical features. An order-aware concatenation and flattening (OCF) procedure is used to concatenate and flatten the patched features. The MHHPA module was applied recurrently to allow for multi-temporal fusion.
Figure 3: Visual examples of detection results on the KAIST dataset. All bounding boxes are filtered by confidence $\geq$ 0.5.

MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

TL;DR

Abstract

MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)