Table of Contents
Fetching ...

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai

TL;DR

BEVFormer v2 addresses the challenge of adapting modern image backbones to BEV recognition by introducing perspective supervision through an auxiliary perspective head and a two-stage BEV detector. The method couples a dense perspective loss with a BEV loss, enabling end-to-end training and faster convergence, while a revamped temporal encoder and hybrid object queries fuse perspective proposals into the BEV pipeline. Extensive experiments on nuScenes across multiple backbones show the approach generalizes well and achieves state-of-the-art results, notably 63.4% NDS with InternImage-XL without 3D pretraining. This work demonstrates that explicit perspective supervision is a powerful signal for 3D understanding from multi-view imagery and paves the way for leveraging modern image architectures in BEV-based perception.

Abstract

We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

TL;DR

BEVFormer v2 addresses the challenge of adapting modern image backbones to BEV recognition by introducing perspective supervision through an auxiliary perspective head and a two-stage BEV detector. The method couples a dense perspective loss with a BEV loss, enabling end-to-end training and faster convergence, while a revamped temporal encoder and hybrid object queries fuse perspective proposals into the BEV pipeline. Extensive experiments on nuScenes across multiple backbones show the approach generalizes well and achieves state-of-the-art results, notably 63.4% NDS with InternImage-XL without 3D pretraining. This work demonstrates that explicit perspective supervision is a powerful signal for 3D understanding from multi-view imagery and paves the way for leveraging modern image architectures in BEV-based perception.

Abstract

We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.
Paper Structure (28 sections, 6 equations, 4 figures, 8 tables)

This paper contains 28 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overall architecture of BEVFormer v2. The image backbone generates features of multi-view images. The perspective 3D head makes perspective predictions which are then encoded as object queries. The BEV head is of encoder-decoder structure. The spatial encoder generates BEV features by aggregating multi-view image features, followed by the temporal encoder that collects history BEV features. The decoder takes hybrid object queries as input and makes the final BEV predictions based on the BEV features. The whole model is trained with the two loss terms of the two detection heads, $L_{pers}$ and $L_{bev}$.
  • Figure 2: Comparison of perspective supervision (a) and BEV supervision (B). The supervision signals of the perspective detector are dense and direct to the image feature, while those of the BEV detector are sparse and indirect.
  • Figure 3: The decoder of the BEV head in BEVFromer v2. The projected centers of the first-stage proposals are used as per-image reference points (purple ones), and they are combined with per-dataset learnded content queries and positional embeddings (blue ones) as hybrid object queries.
  • Figure 4: Visualization of BEVFormer v2 3D object detection predictions.