Table of Contents
Fetching ...

Multimodal Object Query Initialization for 3D Object Detection

Mathijs R. van Geerenstein, Felicia Ruppel, Klaus Dietmayer, Dariu M. Gavrila

TL;DR

EfficientQ3M is proposed, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models that outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization.

Abstract

3D object detection models that exploit both LiDAR and camera sensor features are top performers in large-scale autonomous driving benchmarks. A transformer is a popular network architecture used for this task, in which so-called object queries act as candidate objects. Initializing these object queries based on current sensor inputs is a common practice. For this, existing methods strongly rely on LiDAR data however, and do not fully exploit image features. Besides, they introduce significant latency. To overcome these limitations we propose EfficientQ3M, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models. The proposed initialization method is combined with a "modality-balanced" transformer decoder where the queries can access all sensor modalities throughout the decoder. In experiments, we outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization, while being more efficient than the available alternatives for LiDAR-camera initialization. The proposed method can be applied with any combination of sensor modalities as input, demonstrating its modularity.

Multimodal Object Query Initialization for 3D Object Detection

TL;DR

EfficientQ3M is proposed, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models that outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization.

Abstract

3D object detection models that exploit both LiDAR and camera sensor features are top performers in large-scale autonomous driving benchmarks. A transformer is a popular network architecture used for this task, in which so-called object queries act as candidate objects. Initializing these object queries based on current sensor inputs is a common practice. For this, existing methods strongly rely on LiDAR data however, and do not fully exploit image features. Besides, they introduce significant latency. To overcome these limitations we propose EfficientQ3M, an efficient, modular, and multimodal solution for object query initialization for transformer-based 3D object detection models. The proposed initialization method is combined with a "modality-balanced" transformer decoder where the queries can access all sensor modalities throughout the decoder. In experiments, we outperform the state of the art in transformer-based LiDAR object detection on the competitive nuScenes benchmark and showcase the benefits of input-dependent multimodal query initialization, while being more efficient than the available alternatives for LiDAR-camera initialization. The proposed method can be applied with any combination of sensor modalities as input, demonstrating its modularity.
Paper Structure (20 sections, 2 equations, 5 figures, 4 tables)

This paper contains 20 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Different query initialization approaches in transformer-based LiDAR-camera object detection. We show the initial object query locations from the bird's eye view with (a) input-agnostic initialization as in FUTR3D chen_futr3d_2023 (b) the proposed feature-informed initialization.
  • Figure 2: Different approaches to sensor fusion within a transformer decoder. We call (a) sequential fusion, found in TransFusion bai_transfusion_2022 and (b) modality-balanced fusion in our proposed method. L is LiDAR and C is camera.
  • Figure 3: Overview of EfficientQ3M, with the main contribution framed in red. We start with a fixed grid $\mathcal{C}$ of $M_{dense}$ query location proposals. We sample LiDAR and camera features at instance level for each proposal ①, and predict a bounding box relative to the grid location. The 3D $xyz$ centers of the top-$M$ bounding boxes with the highest confidence scores are selected as the set of initial object query locations. We re-sample LiDAR and camera features for these $M$ object queries ② and pass them to the modality-balanced decoder, where the queries have access to both sensor modalities in each layer of the decoder ③. A regression and classification head is used to produce the final detections from the object queries at the output of the decoder.
  • Figure 4: Detection performance vs. the number of decoder layers on the nuScenes val set in the LiDAR-only and LiDAR-camera setting, compared to FUTR3D chen_futr3d_2023. The proposed method needs fewer layers and fewer queries to outperform FUTR3D.
  • Figure 5: Example of output predictions with the proposed model on the nuScenes val set. The LiDAR BEV shows ground truth objects in dark green.