Table of Contents
Fetching ...

MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection

Tong Ning, Ke Lu, Xirui Jiang, Jian Xue

TL;DR

This paper proposes a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space, and designs a Motion Elimination module to remove the relatively static objects for temporal fusion.

Abstract

Utilizing temporal information to improve the performance of 3D detection has made great progress recently in the field of autonomous driving. Traditional transformer-based temporal fusion methods suffer from quadratic computational cost and information decay as the length of the frame sequence increases. In this paper, we propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space. Moreover, we design a Motion Elimination module to remove the relatively static objects for temporal fusion. On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task, exhibiting state-of-the-art performance among existing temporal fusion methods.

MambaDETR: Query-based Temporal Modeling using State Space Model for Multi-View 3D Object Detection

TL;DR

This paper proposes a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space, and designs a Motion Elimination module to remove the relatively static objects for temporal fusion.

Abstract

Utilizing temporal information to improve the performance of 3D detection has made great progress recently in the field of autonomous driving. Traditional transformer-based temporal fusion methods suffer from quadratic computational cost and information decay as the length of the frame sequence increases. In this paper, we propose a novel method called MambaDETR, whose main idea is to implement temporal fusion in the efficient state space. Moreover, we design a Motion Elimination module to remove the relatively static objects for temporal fusion. On the standard nuScenes benchmark, our proposed MambaDETR achieves remarkable result in the 3D object detection task, exhibiting state-of-the-art performance among existing temporal fusion methods.

Paper Structure

This paper contains 21 sections, 13 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Different temporal fusion methods in recurrent manner and sequential manner.
  • Figure 2: Overall architecture of the proposed MambaDETR. Key enhancements include 2D-priors-based query initialization, Motion Elimination to retain only moving 3D queries across frames, and Query Mamba for state-space temporal fusion. The 3D queries interact with the current image frame in a transformer decoder, producing the final 3D object detections.
  • Figure 3: Detail structure of Query Generator and Motion Elimination modules in the MambaDETR architecture. (a) The Query Generator utilizes a 2D detector and DepthNet to create 2D proposals, which are transformed into 3D queries by integrating position and semantic embeddings. (b) The Motion Elimination module then filters out static 3D queries across frames by measuring the distance between the center points.
  • Figure 4: Visualization of MambaDETR.
  • Figure 5: Performance(mAP), Inference Speed(samples/Second) and Memory Requirements(GB) of MambaDETR as the function of the sequence length.