BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, Jie Zhou, Jifeng Dai
TL;DR
BEVFormer v2 addresses the challenge of adapting modern image backbones to BEV recognition by introducing perspective supervision through an auxiliary perspective head and a two-stage BEV detector. The method couples a dense perspective loss with a BEV loss, enabling end-to-end training and faster convergence, while a revamped temporal encoder and hybrid object queries fuse perspective proposals into the BEV pipeline. Extensive experiments on nuScenes across multiple backbones show the approach generalizes well and achieves state-of-the-art results, notably 63.4% NDS with InternImage-XL without 3D pretraining. This work demonstrates that explicit perspective supervision is a powerful signal for 3D understanding from multi-view imagery and paves the way for leveraging modern image architectures in BEV-based perception.
Abstract
We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.
