MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

Venkatraman Narayanan; Bala Sai; Rahul Ahuja; Pratik Likhar; Varun Ravi Kumar; Senthil Yogamani

MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar, Senthil Yogamani

TL;DR

This work tackles robust multimodal 3D object detection for autonomous driving by fuse cameras and LiDAR in BEV space. It introduces MambaFusion, which interleaves Mamba state-space blocks with windowed attention for a hybrid LiDAR encoder, a Multi-Modal Token Alignment module for calibration correction, an uncertainty-aware adaptive fusion gate, and a structure-conditioned diffusion head with graph-based refinement, all trained end-to-end and reinforced by temporal self-distillation. It achieves state-of-the-art nuScenes performance with linear-time complexity, specifically $O(TN)$ in temporal BEV token processing, and shows robustness to calibration noise, sensor sparsity, and range. The results demonstrate that combining selective state-space models with reliability-driven fusion and physics-informed refinement yields robust, scalable, and interpretable 3D perception for real-world autonomous driving.

Abstract

Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.

MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

TL;DR

in temporal BEV token processing, and shows robustness to calibration noise, sensor sparsity, and range. The results demonstrate that combining selective state-space models with reliability-driven fusion and physics-informed refinement yields robust, scalable, and interpretable 3D perception for real-world autonomous driving.

Abstract

Paper Structure (49 sections, 11 equations, 7 figures, 11 tables)

This paper contains 49 sections, 11 equations, 7 figures, 11 tables.

Introduction
Related Work
Camera-based 3D Object Detection
LiDAR-based 3D Object Detection
Efficient Sequence Models for Vision
Windowed Attention Mechanisms
Multi-Modal Fusion for 3D Detection
Uncertainty and Calibration in Perception
Spatial Reasoning and Diffusion-based Refinement
Methodology
Multi-Scale Hybrid LiDAR Encoding
Spatiotemporal Camera BEV Aggregation
Multi-Modal Token Alignment (MTA)
Adaptive Fusion with Uncertainty Modeling
Spatial reliability gating.
...and 34 more sections

Figures (7)

Figure 1: Architecture Overview. MambaFusion features: (1) multi-frame camera and LiDAR encoding with spatiotemporal Transformers and hybrid Mamba SSM/Transformer blocks, (2) Multi-Modal Token Alignment (MTA) for robust cross-modal spatial calibration, (3) uncertainty-aware adaptive fusion via bidirectional attention and spatial reliability gating, (4) dual-stream proposal generation and geometry-aware graph reasoning, (5) structure-conditioned diffusion for confidence refinement, and (6) temporal self-distillation for prediction stability. All components are jointly optimized for robust multi-modal 3D object detection.
Figure 2: Calibration robustness analysis. Performance degradation under increasing rotational perturbation to camera extrinsics. The Multi-Modal Token Alignment (MTA) module provides increasing protection under larger calibration errors, reducing degradation by 52% at 2.0° perturbation (6.5 NDS improvement). At realistic 0.5° drift levels observed in field deployments levinson2013automatic, MTA maintains 98.1% of clean performance versus 96.7% without alignment. The divergence between the curves demonstrates that MTA's benefits scale with the severity of the perturbation.
Figure 3: Robustness to complete sensor dropout. Under complete camera failure, the system retains 87.9% NDS and 85.2% mAP relative to the full-sensor baseline. Under complete LiDAR failure, retention drops to 79.7% NDS and 67.3% mAP. Although performance degrades significantly, the system remains functional, preserving essential safety margins required for fault-tolerant autonomous driving. The asymmetric degradation arises from complementary sensing roles: cameras provide rich semantic cues aiding coarse localization and recognition, whereas LiDAR contributes high-fidelity geometry necessary for precise 3D spatial reasoning.
Figure 4: Performance across detection range on nuScenes validation set. Comparison of BEVFusion's LiDAR-only backbone (blue), BEVFusion full fusion (red), and MambaFusion (ours). MambaFusion achieves consistent improvements with gains increasing at greater distances: +1.5 points at 0--20m, +3.9 points at 20--30m, and +5.5 points beyond 30m relative to BEVFusion. This pattern reflects complementary degradation profiles of camera and LiDAR sensors, with adaptive fusion providing the greatest benefits where sensor quality degrades most.
Figure 5: Per-component latency breakdown on NVIDIA A100. Left: Percentage distribution showing that the hybrid LiDAR encoder (28.6%) and fusion decoder with MTA (26.5%) constitute 55% of total latency. Right: Absolute latencies reveal that proposed refinement modules—GNN reasoning (6.2ms, red border) and geometry-conditioned diffusion (5.8ms, red border)—contribute only 16.1% combined overhead. Despite adding these refinement modules, the total runtime of 74.6ms (13.4 FPS) remains competitive due to linear-time Mamba blocks replacing quadratic temporal attention for BEV feature aggregation.
...and 2 more figures

MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

TL;DR

Abstract

MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)