Table of Contents
Fetching ...

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

Jiasen Wang, Zhenglin Li, Ke Sun, Xianyuan Liu, Yang Zhou

TL;DR

DVPE tackles interference and learning difficulty in sparse query-based multi-view 3D object detection by dividing the global 3D space into local virtual spaces and applying visibility cross-attention within each space, decoupling position embedding from camera poses. It enriches temporal modeling with 2D RoI features and introduces a one-to-many assignment strategy to provide richer supervision during training. The approach yields state-of-the-art results on nuScenes, notably 57.2% mAP and 64.5% NDS on the test set, and demonstrates strong ablation-supported gains from divided views, temporal RoI fusion, and extra query supervision. DVPE's design offers scalable, view-localized inference with potential applicability to other sparse query-based multi-view detectors, signaling a practical impact for vision-only autonomous driving perception systems.

Abstract

Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating multi-view features. Moreover, different poses of cameras present challenges in training global attention models. To address these problems, this paper proposes a divided view method, in which features are modeled globally via the visibility crossattention mechanism, but interact only with partial features in a divided local virtual space. This effectively reduces interference from other irrelevant features and alleviates the training difficulties of the transformer by decoupling the position embedding from camera poses. Additionally, 2D historical RoI features are incorporated into the object-centric temporal modeling to utilize highlevel visual semantic information. The model is trained using a one-to-many assignment strategy to facilitate stability. Our framework, named DVPE, achieves state-of-the-art performance (57.2% mAP and 64.5% NDS) on the nuScenes test set. Codes will be available at https://github.com/dop0/DVPE.

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

TL;DR

DVPE tackles interference and learning difficulty in sparse query-based multi-view 3D object detection by dividing the global 3D space into local virtual spaces and applying visibility cross-attention within each space, decoupling position embedding from camera poses. It enriches temporal modeling with 2D RoI features and introduces a one-to-many assignment strategy to provide richer supervision during training. The approach yields state-of-the-art results on nuScenes, notably 57.2% mAP and 64.5% NDS on the test set, and demonstrates strong ablation-supported gains from divided views, temporal RoI fusion, and extra query supervision. DVPE's design offers scalable, view-localized inference with potential applicability to other sparse query-based multi-view detectors, signaling a practical impact for vision-only autonomous driving perception systems.

Abstract

Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating multi-view features. Moreover, different poses of cameras present challenges in training global attention models. To address these problems, this paper proposes a divided view method, in which features are modeled globally via the visibility crossattention mechanism, but interact only with partial features in a divided local virtual space. This effectively reduces interference from other irrelevant features and alleviates the training difficulties of the transformer by decoupling the position embedding from camera poses. Additionally, 2D historical RoI features are incorporated into the object-centric temporal modeling to utilize highlevel visual semantic information. The model is trained using a one-to-many assignment strategy to facilitate stability. Our framework, named DVPE, achieves state-of-the-art performance (57.2% mAP and 64.5% NDS) on the nuScenes test set. Codes will be available at https://github.com/dop0/DVPE.
Paper Structure (28 sections, 10 equations, 4 figures, 7 tables)

This paper contains 28 sections, 10 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Illustration of the image feature aggregation process of sparse sampling, global attention, and our proposed divided view approach, all belonging to sparse query-based paradigms. The green dot represents the 3D reference point of a query. Solid lines indicate projection onto the image plane, while triangles represent projected sampling points on images. The portion enclosed by a red frame represents the image features that need to interact with queries through cross-attention. The blue frustums represent the regions confirmed by camera rays, where the 3D coordinates are encoded to position embedding for cross-attention operations.
  • Figure 2: Overall architecture of DVPE. The framework is based on the transformer decoder, where the initial queries update iteratively through temporal attention and visibility cross-attention. In temporal attention, object queries interact with themselves as well as historical decoder embedding and 2D RoI embedding stored in the memory queue. Before visibility cross-attention, object queries and image features are grouped based on their 3D coordinates and then transformed into several local virtual spaces to obtain divided view position embedding. Isolated cross-attention is performed between queries and image features within different spaces. Subsequently, predictions are made in local virtual spaces and then transformed back to the 3D world coordinate system as final detection results, following which the memory queue is updated. Additional 3D reference points are used in conjunction with default 3D reference points for one-to-many assignment during training.
  • Figure 3: Illustration of space division (left) and transformation between the divided space and local virtual space in BEV. We only partition the global space into four, using the $v$-th space for illustration. The green dot denotes a 3D reference point, while the green arrows in the middle and on the right indicate the predicted yaws in the local virtual space and the world coordinate system.
  • Figure 4: Visualization of image regions within one of the divided spaces and the corresponding attention maps of a query. Attention maps are from two heads of the last decoder layer.