Table of Contents
Fetching ...

CoMamba: Real-time Cooperative Perception Unlocked with State Space Models

Jinlong Li, Xinyu Liu, Baolu Li, Runsheng Xu, Jiachen Li, Hongkai Yu, Zhengzhong Tu

TL;DR

The paper tackles real-time cooperative perception for V2X systems by addressing the scalability and latency limitations of transformer-based fusion. It introduces CoMamba, a linear-complexity framework based on state-space models (Mamba) with two novel modules, CSS2D and GPM, to enable efficient 3D feature fusion across multiple connected agents. Empirical results on OPV2V, V2XSet, and V2V4Real show state-of-the-art detection accuracy with real-time performance (26.9 FPS) and linear scaling with increasing agents. This work demonstrates the practicality of SSM-based backbones for large-scale, onboard cooperative perception in intelligent transportation networks.

Abstract

Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.

CoMamba: Real-time Cooperative Perception Unlocked with State Space Models

TL;DR

The paper tackles real-time cooperative perception for V2X systems by addressing the scalability and latency limitations of transformer-based fusion. It introduces CoMamba, a linear-complexity framework based on state-space models (Mamba) with two novel modules, CSS2D and GPM, to enable efficient 3D feature fusion across multiple connected agents. Empirical results on OPV2V, V2XSet, and V2V4Real show state-of-the-art detection accuracy with real-time performance (26.9 FPS) and linear scaling with increasing agents. This work demonstrates the practicality of SSM-based backbones for large-scale, onboard cooperative perception in intelligent transportation networks.

Abstract

Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.
Paper Structure (10 sections, 7 equations, 5 figures, 3 tables)

This paper contains 10 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our CoMamba V2X-based perception framework. (a) CoMamba V2X perception system involves V2X metadata sharing, LiDAR visual encoder, feature sharing, and CoMamba fusion network to conduct final prediction. (b) our CoMamba fusion network leverages the Cooperative 2D-Selective-Scan Module to effectively fuse the complex interactions present in high-resource-cost V2X data sequences. The Global-wise Pooling Module efficiently attains global information among the overlapping features of the CAVs.
  • Figure 2: Illustration of the Cooperative 2D-Selective-Scan (CSS2D) process. The features of $K$ CAVs, represented as $\mathbb{R}^{H \times W}$, are embedded into patches. These patches are then traversed along four different scanning paths, with each 1D sequence ($KHW$) independently processed by distinct Mamba blocks gu2023mamba in parallel. Afterward, the resulting outputs are reshaped and merged to form the 3D feature maps, which maintain the same dimensions as the input features. In this instance, we use $K = 3$ as an illustrative example.
  • Figure 3: Scalability comparisons: CoMamba, V2X-ViT xu2022v2x and CoBEVT xu2022cobevt in GFLOPs, latency, and GPU memory footprint with respect to the number of agents (total feature dimensions). The dotted lines indicate the estimated values when tested GPUs are out-of-memory.
  • Figure 4: Visualizations of 3D object detection results.Green and red 3D bounding boxes represent the ground truth and prediction, respectively. Some false detection examples are highlighted using the red arrow.
  • Figure 5: Visualization of Fused Intermediate Features on the OPV2V Default Testing Set. We compared the fused intermediate features from our method, CoMamba, against other SOTA methods, V2X-ViT and CoBEVT, using two samples. The first row shows two point cloud samples from our CoMamba, corresponding to the fused intermediate features in the following rows. It is evident that our fused intermediate features are clearer, with more accurate local features corresponding to objects. Additionally, the shape of the scenario modeling is more complete compared to the other methods.