Table of Contents
Fetching ...

Vision-Driven 2D Supervised Fine-Tuning Framework for Bird's Eye View Perception

Lei He, Qiaoyi Wang, Honglin Sun, Qing Xu, Bolin Gao, Shengbo Eben Li, Jianqiang Wang, Keqiang Li

TL;DR

This work proposes a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model’s generalization capabilities in new scene data and significantly reducing the dependence on expensive BEV ground truths.

Abstract

Visual bird's eye view (BEV) perception, due to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Moreover, most massproduced autonomous driving systems are only equipped with surround camera sensors and lack LiDAR data for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model's generalization capabilities in new scene data. Considering the maturity and development of 2D perception technologies, our method significantly reduces the dependency on high-cost BEV ground truths and shows promising industrial application prospects. Extensive experiments and comparative analyses conducted on the nuScenes and Waymo public datasets demonstrate the effectiveness of our proposed method.

Vision-Driven 2D Supervised Fine-Tuning Framework for Bird's Eye View Perception

TL;DR

This work proposes a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model’s generalization capabilities in new scene data and significantly reducing the dependence on expensive BEV ground truths.

Abstract

Visual bird's eye view (BEV) perception, due to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Moreover, most massproduced autonomous driving systems are only equipped with surround camera sensors and lack LiDAR data for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model's generalization capabilities in new scene data. Considering the maturity and development of 2D perception technologies, our method significantly reduces the dependency on high-cost BEV ground truths and shows promising industrial application prospects. Extensive experiments and comparative analyses conducted on the nuScenes and Waymo public datasets demonstrate the effectiveness of our proposed method.
Paper Structure (19 sections, 9 equations, 4 figures, 3 tables)

This paper contains 19 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: (a) shows the traditional 3D training framework, which relies on LiDAR and 3D annotated data. (b) illustrates our proposed 2D supervised framework, which only requires multi-view images acquisition using purely visual sensors and utilizes 2D annotations to supervise the training of the 3D model, ultimately achieving outstanding performance.
  • Figure 2: The pipeline of our proposed 2D-supervised fine-tuning model. The pipeline of our proposed 2D-supervised fine-tuning model is as follows: The 3D perception outcomes are derived through inference by the BEV model, which are then projected onto the plane of the surround-view images for alignment with manually annotated 2D ground truth labels. This alignment is utilized to construct a loss function that facilitates the fine-tuning of the BEV model parameters. Furthermore, depth information is generated offline to assist in the supervision of the matching process, thereby enhancing the accuracy of 3D-2D matching.
  • Figure 3: Visualization of fine-tuning results on the Waymo dataset. We display the 3D predictions from five different viewpoint images. The first row shows the predictions from the pre-trained model, while the second row presents the predictions after fine-tuning.
  • Figure 4: Visualization of fine-tuning results on the nuScenes dataset. The first three columns display 3D predictions from six different viewpoint images, while the fourth column presents the pre-training and fine-tuning detection results from the BEV perspective for the corresponding scene. The first two rows show predictions from the pre-trained model, and the last two rows present predictions from the fine-tuned model.