Table of Contents
Fetching ...

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

Haisheng Su, Wei Wu, Feixiang Song, Junjie Zhang, Zhenjie Yang, Junchi Yan

TL;DR

DriveMamba introduces a task-centric, scalable end-to-end autonomous driving framework that unifies view correspondence, dynamic task relations, and long-horizon temporal fusion inside a single sparse-token Mamba decoder. By tokenizing raw sensor data and task queries, employing a trajectory-guided hybrid spatiotemporal scan, and maintaining a short-term memory of history queries, it achieves linear-complexity attention and improved planning-oriented perception without constructing dense BEV features. Extensive experiments on nuScenes and Bench2Drive demonstrate superior planning and perception performance with significantly improved efficiency, validating the approach's scalability to larger models and longer temporal horizons. The work advances practical E2E-AD by enabling robust, end-to-end optimization of perception, prediction, and planning within a unified, parallel decoding framework.

Abstract

Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.

DriveMamba: Task-Centric Scalable State Space Model for Efficient End-to-End Autonomous Driving

TL;DR

DriveMamba introduces a task-centric, scalable end-to-end autonomous driving framework that unifies view correspondence, dynamic task relations, and long-horizon temporal fusion inside a single sparse-token Mamba decoder. By tokenizing raw sensor data and task queries, employing a trajectory-guided hybrid spatiotemporal scan, and maintaining a short-term memory of history queries, it achieves linear-complexity attention and improved planning-oriented perception without constructing dense BEV features. Extensive experiments on nuScenes and Bench2Drive demonstrate superior planning and perception performance with significantly improved efficiency, validating the approach's scalability to larger models and longer temporal horizons. The work advances practical E2E-AD by enabling robust, end-to-end optimization of perception, prediction, and planning within a unified, parallel decoding framework.

Abstract

Recent advances towards End-to-End Autonomous Driving (E2E-AD) have been often devoted on integrating modular designs into a unified framework for joint optimization e.g. UniAD, which follow a sequential paradigm (i.e., perception-prediction-planning) based on separable Transformer decoders and rely on dense BEV features to encode scene representations. However, such manual ordering design can inevitably cause information loss and cumulative errors, lacking flexible and diverse relation modeling among different modules and sensors. Meanwhile, insufficient training of image backbone and quadratic-complexity of attention mechanism also hinder the scalability and efficiency of E2E-AD system to handle spatiotemporal input. To this end, we propose DriveMamba, a Task-Centric Scalable paradigm for efficient E2E-AD, which integrates dynamic task relation modeling, implicit view correspondence learning and long-term temporal fusion into a single-stage Unified Mamba decoder. Specifically, both extracted image features and expected task outputs are converted into token-level sparse representations in advance, which are then sorted by their instantiated positions in 3D space. The linear-complexity operator enables efficient long-context sequential token modeling to capture task-related inter-dependencies simultaneously. Additionally, a bidirectional trajectory-guided "local-to-global" scan method is designed to preserve spatial locality from ego-perspective, thus facilitating the ego-planning. Extensive experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superiority, generalizability and great efficiency of DriveMamba.
Paper Structure (26 sections, 12 equations, 6 figures, 16 tables)

This paper contains 26 sections, 12 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Comparison of different end-to-end autonomous driving paradigms. (a) and (b) follow the sequential Transformer paradigm based on dense BEV features hu2023planning and sparse query set sun2024sparsedrive respectively. (c) explores the multi-task BEV learning with parallel Transformer decoders weng2024drive. (d) Our proposed Task-Centric paradigm learns task-relations dynamically through a unified Mamba decoder, which directly leverages raw sensor inputs and history token memory for long-term spatiotemporal modeling, without construction of expensive BEV features, thus scalable and efficient.
  • Figure 2: Framework of DriveMamba. The multi-view images are encoded into token-level feature sequence and the spatiotemporal queries for different tasks are initialized respectively. Then we adapt the unified Mamba decoder with bidirectional serialization for simultaneous view correspondence learning, task relation modeling and long-term temporal fusion.
  • Figure 3: Different bidirectional scan methods. Both spatial and temporal scan types are illustrated.
  • Figure 4: Detailed structure of Unified Mamba Decoder. (a) B-Mamba layer is the basic component used for decoding. (b) We illustrate the divided modeling type here for clarity.
  • Figure 5: Comparisons between Transformer-based and our Mamba-based E2E-AD methods.
  • ...and 1 more figures