Table of Contents
Fetching ...

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

Yingqi Tang, Zhaotie Meng, Guoliang Chen, Erkang Cheng

TL;DR

SimPB introduces a unified, one-stage detector that jointly produces 2D perspective-view boxes and 3D BEV boxes from multiple cameras. Its core is a hybrid decoder that alternates between 2D and 3D processing, coupled with Dynamic Query Allocation and Adaptive Query Aggregation to cyclically refine 2D–3D representations. A camera-group attention mechanism further improves intra-camera query interactions. Evaluated on nuScenes, SimPB achieves strong 2D and 3D detection performance, demonstrates transferability to other detectors, and provides ablations confirming the effectiveness of the cyclic integration and dynamic querying for multi-view perception in autonomous driving.

Abstract

The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: https://github.com/nullmax-vision/SimPB.

SimPB: A Single Model for 2D and 3D Object Detection from Multiple Cameras

TL;DR

SimPB introduces a unified, one-stage detector that jointly produces 2D perspective-view boxes and 3D BEV boxes from multiple cameras. Its core is a hybrid decoder that alternates between 2D and 3D processing, coupled with Dynamic Query Allocation and Adaptive Query Aggregation to cyclically refine 2D–3D representations. A camera-group attention mechanism further improves intra-camera query interactions. Evaluated on nuScenes, SimPB achieves strong 2D and 3D detection performance, demonstrates transferability to other detectors, and provides ablations confirming the effectiveness of the cyclic integration and dynamic querying for multi-view perception in autonomous driving.

Abstract

The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: https://github.com/nullmax-vision/SimPB.
Paper Structure (30 sections, 9 equations, 27 figures, 10 tables)

This paper contains 30 sections, 9 equations, 27 figures, 10 tables.

Figures (27)

  • Figure 1: Comparisons of different multi-view object detection pipelines. (a) Multi-view 3D object detection. (b) A two-stage multi-view 3D object detector where 2D box detection is used as token selection or 3D query initialization. (c) Our proposed unified paradigm simultaneously predicts 2D and 3D results in a single model.
  • Figure 2: Overview of SimPB, a unified multi-view 2D and 3D object detection framework. Multi-view features are extracted by an image backbone and then enhanced by an encoder module. A hybrid decoder module which consists of multi-view 2D decoder layers and 3D decoder layers is used to compute 2D and 3D detection results.
  • Figure 3: Illustration of the Query-Group Attention. We enforce interaction among 2D queries only within the same camera group. DA represents deformable attention.
  • Figure 4: Illusration of the Adaptive Query Aggregation. The indicator vector represents whether a 2D query is truncated or not.
  • Figure S1: query initilization
  • ...and 22 more figures