Table of Contents
Fetching ...

Multi-View Attentive Contextualization for Multi-View 3D Object Detection

Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, Tianfu Wu

TL;DR

The paper tackles the challenge of 2D-to-3D feature lifting in query-based MV3D object detectors, where existing approaches struggle to balance high-resolution 2D context with computational efficiency. It introduces Multi-View Attentive Contextualization (MvACon), a cluster-attentive module based on Patch-to-Cluster attention that enriches 2D features with global scene context prior to 2D-to-3D lifting, and is designed to be plug-and-play for both decoder-only and encoder-decoder MV3D architectures. By leveraging multi-scale clustering across feature pyramids, MvACon achieves a denser, more semantically meaningful 3D awareness, leading to improvements in location, orientation, and velocity estimation on NuScenes and Waymo-mini across PETR, BEVFormer, and DFA3D baselines. The results demonstrate that contextualized features materially enhance MV3D performance, supporting the adage that contextualized representations matter in 3D scene understanding and offering a practical path to stronger camera-based 3D detectors.

Abstract

We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".

Multi-View Attentive Contextualization for Multi-View 3D Object Detection

TL;DR

The paper tackles the challenge of 2D-to-3D feature lifting in query-based MV3D object detectors, where existing approaches struggle to balance high-resolution 2D context with computational efficiency. It introduces Multi-View Attentive Contextualization (MvACon), a cluster-attentive module based on Patch-to-Cluster attention that enriches 2D features with global scene context prior to 2D-to-3D lifting, and is designed to be plug-and-play for both decoder-only and encoder-decoder MV3D architectures. By leveraging multi-scale clustering across feature pyramids, MvACon achieves a denser, more semantically meaningful 3D awareness, leading to improvements in location, orientation, and velocity estimation on NuScenes and Waymo-mini across PETR, BEVFormer, and DFA3D baselines. The results demonstrate that contextualized features materially enhance MV3D performance, supporting the adage that contextualized representations matter in 3D scene understanding and offering a practical path to stronger camera-based 3D detectors.

Abstract

We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".
Paper Structure (14 sections, 5 equations, 12 figures, 7 tables)

This paper contains 14 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: The effects of our proposed MvACon in the 2D-to-3D feature lifting. Consider a point (red) on the car in (a), which is projected from a 3D BEV anchor point. In lifting 2D features to ground the 3D BEV anchor point, vanilla BEVFormer bevformer utilizes a predefined number of deformable points, with offsets learned through a 6-layer cross-attention module relative to the projection point. (b) shows the deformed points after the final cross-attention layer, most of which have low attention weights, indicating the model's uncertainty or inability to consolidate contributions effectively for good lifting. Our MvACon tackles this issue with clustering-based attention, as visualized in (c). (d) shows the deformed points, where we observe not only high-confidence points on the car but also on the building. We further observe that the points on the building remain stable across encoding layers (see Fig. \ref{['fig:dfpts']}) and consecutive frames (see suppl.). With those high-confident deformed points in a spatiotemporally stable configuration, our MvACon may induce a local object-context aware coordinate system that helps the overall performance, especially the estimation of velocity and orientation, as we quantitatively observed in experiments. See text for details.
  • Figure 2: Overview of a query-based MV3D object detection pipeline with our proposed MvACon. Our proposed MvACon is a plug-and-play module for two state-of-the-art query-based MV3D object detection paradigms (e.g., PETR petr and BEVFormer bevformer respectively), which computes attentively contextualized features to facilitate better 2D-to-3D feature lifting in the two paradigms. See text for details.
  • Figure 3: Visualization results of learned cluster contexts in our MvACon on the NuScenes validation set. We sum all the learned clusters along the channel and upsample it to the original image resolution through bilinear interpolation. We observed that the learned cluster context encodes abundant context information in the scene. We provide details with raw images in the supplementary.
  • Figure 4: Visualization results of the deformable points originating from a 2D reference point, which is projected from a 3D BEV anchor point in the BEVFormer encoder, on NuScenes validation set. We utilize the same BEV anchor point as demonstrated in Fig. \ref{['fig:teaser']}. From left to right and up to bottom, we display the deformable points output from each layer (#1-#6) in the encoder, respectively.
  • Figure 5: Qualitative comparisons between BEVFormer and our MvACon method on NuScenes validation set.
  • ...and 7 more figures