Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

Zhili Chen; Shuangjie Xu; Maosheng Ye; Zian Qian; Xiaoyi Zou; Dit-Yan Yeung; Qifeng Chen

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

Zhili Chen, Shuangjie Xu, Maosheng Ye, Zian Qian, Xiaoyi Zou, Dit-Yan Yeung, Qifeng Chen

TL;DR

This paper tackles the costly scaling of BEV-based 3D object detection when increasing spatial resolution. It introduces VectorFormer, a camera-based detector that uses a high-resolution vector representation factorized into two low-rank components along the x- and y-axes, enabling fine-grained BEV modeling with linear $O(n)$ complexity rather than quadratic scaling. Central to the approach are the Vector Query Scattering and Vector Query Gathering modules, which generate sparse high-resolution BEV queries, fuse them with low-resolution BEV features, and interact with multi-view image features through deformable attention and a shared LR-HR fusion mechanism. Temporal integration and the vector queries as decoding queries further enhance prediction accuracy. Experiments on nuScenes and Waymo show state-of-the-art performance and improved efficiency, demonstrating the method’s potential to advance high-fidelity, real-time 3D perception from multi-view cameras.

Abstract

The Bird's-Eye-View (BEV) representation is a critical factor that directly impacts the 3D object detection performance, but the traditional BEV grid representation induces quadratic computational cost as the spatial resolution grows. To address this limitation, we present a new camera-based 3D object detector with high-resolution vector representation: VectorFormer. The presented high-resolution vector representation is combined with the lower-resolution BEV representation to efficiently exploit 3D geometry from multi-camera images at a high resolution through our two novel modules: vector scattering and gathering. To this end, the learned vector representation with richer scene contexts can serve as the decoding query for final predictions. We conduct extensive experiments on the nuScenes dataset and demonstrate state-of-the-art performance in NDS and inference time. Furthermore, we investigate query-BEV-based methods incorporated with our proposed vector representation and observe a consistent performance improvement.

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

TL;DR

complexity rather than quadratic scaling. Central to the approach are the Vector Query Scattering and Vector Query Gathering modules, which generate sparse high-resolution BEV queries, fuse them with low-resolution BEV features, and interact with multi-view image features through deformable attention and a shared LR-HR fusion mechanism. Temporal integration and the vector queries as decoding queries further enhance prediction accuracy. Experiments on nuScenes and Waymo show state-of-the-art performance and improved efficiency, demonstrating the method’s potential to advance high-fidelity, real-time 3D perception from multi-view cameras.

Abstract

Paper Structure (15 sections, 8 equations, 4 figures, 9 tables)

This paper contains 15 sections, 8 equations, 4 figures, 9 tables.

Introduction
Related Work
Sparse Query 3D detector
Dense BEV Feature 3D Detector
Method
Vector Query
Vector Query Scattering
Joint LR-HR Spatial Features Extraction
Vector Query Gathering
Model Architecture Details
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusion

Figures (4)

Figure 1: Comparison between the typical query-BEV architecture li2022bevformer and our proposed VectorFormer with the novel vector query. Compared to the traditional design in Fig. \ref{['fig:CoreA']}, our vector queries in Fig. \ref{['fig:CoreB']} are encoded with finer-grained scene contexts, which transform into more accurate 3D predictions by decoder. Fig. \ref{['fig:motivation_nds']} and Fig. \ref{['fig:motivation_time']} are the effectiveness and efficiency comparisons. Compared to the architecture in Fig. \ref{['fig:CoreA']}, the performance (Fig. \ref{['fig:motivation_nds']}) of VectorFormer keeps benefitting by scaling up the representation resolution without leading to a long inference time and high memory cost (Fig. \ref{['fig:motivation_time']}). Noted that BEVFormer will be out-of-memory (OOM) when training under the BEV resolution of $450\times 450$ with NVIDIA A100 40GB GPUs.
Figure 2: The overall framework of our proposed VectorFormer. In the middle, the encoder takes the BEV query and Vector query at the current timestamp and the ones from the previous timestamp as inputs for interacting with multi-view image features at each layer. The learned representations are further transformed into 3D predictions by the decoder. We highlight our innovations for learning high-resolution vector representation within the encoder and zoom in on the designs of Vector Query Scattering (Fig. \ref{['fig:scatter']} for the details) and the Vector Query Gathering at the right. In the Vector Query Scattering module, we represent the sparse high-resolution (HR) BEV features by first recognizing the foreground regions and then compositing them with the factorized Vector queries of $\mathbf{V^X}$ and $\mathbf{V^Y}$. After jointly interacting with the image features, we finally factorize the learned HR BEV features back into $\mathbf{V^X}$ and $\mathbf{V^Y}$ through the Vector Query Gathering module.
Figure 3: The design details for the Vector Query Scatter Module of our proposed VectorFormer. We recognize the foreground regions by first predicting an objectness heatmap using high-resolution (HR) Vector queries $\mathbf{V^X}$ and $\mathbf{V^Y}$ and low-resolution (LR) BEV queries as input and then locate the foreground regions by taking the directional Top-$k$ ($k=1$ here as demonstration) on the heatmap. Next, we apply deformable offsetting for learning to cover more informative regions. We finally construct the sparse HR BEV by fusing the factorized Vector queries and the grid sampled LR BEV features for these informative regions.
Figure 4: Visualization results of VectorFormer on nuScenes caesar2020nuscenes validation set. Detection predictions with ground truth in multi-view camera images are shown on the left and in bird's-eye-view is shown on the right.

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

TL;DR

Abstract

Learning High-resolution Vector Representation from Multi-Camera Images for 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)