Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection
Zhili Chen, Shuangjie Xu, Maosheng Ye, Zian Qian, Xiaoyi Zou, Dit-Yan Yeung, Qifeng Chen
TL;DR
This paper tackles the costly scaling of BEV-based 3D object detection when increasing spatial resolution. It introduces VectorFormer, a camera-based detector that uses a high-resolution vector representation factorized into two low-rank components along the x- and y-axes, enabling fine-grained BEV modeling with linear $O(n)$ complexity rather than quadratic scaling. Central to the approach are the Vector Query Scattering and Vector Query Gathering modules, which generate sparse high-resolution BEV queries, fuse them with low-resolution BEV features, and interact with multi-view image features through deformable attention and a shared LR-HR fusion mechanism. Temporal integration and the vector queries as decoding queries further enhance prediction accuracy. Experiments on nuScenes and Waymo show state-of-the-art performance and improved efficiency, demonstrating the method’s potential to advance high-fidelity, real-time 3D perception from multi-view cameras.
Abstract
The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.
