Multi-View Attentive Contextualization for Multi-View 3D Object Detection
Xianpeng Liu, Ce Zheng, Ming Qian, Nan Xue, Chen Chen, Zhebin Zhang, Chen Li, Tianfu Wu
TL;DR
The paper tackles the challenge of 2D-to-3D feature lifting in query-based MV3D object detectors, where existing approaches struggle to balance high-resolution 2D context with computational efficiency. It introduces Multi-View Attentive Contextualization (MvACon), a cluster-attentive module based on Patch-to-Cluster attention that enriches 2D features with global scene context prior to 2D-to-3D lifting, and is designed to be plug-and-play for both decoder-only and encoder-decoder MV3D architectures. By leveraging multi-scale clustering across feature pyramids, MvACon achieves a denser, more semantically meaningful 3D awareness, leading to improvements in location, orientation, and velocity estimation on NuScenes and Waymo-mini across PETR, BEVFormer, and DFA3D baselines. The results demonstrate that contextualized features materially enhance MV3D performance, supporting the adage that contextualized representations matter in 3D scene understanding and offering a practical path to stronger camera-based 3D detectors.
Abstract
We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection, prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting, due to high computational costs, or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR, showing consistent detection performance improvement, especially in enhancing performance in location, orientation, and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision -- ``(contextualized) feature matters".
