Table of Contents
Fetching ...

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao

TL;DR

The paper surveys bird's-eye-view (BEV) perception for autonomous driving, framing BEV as a unified, fusion-friendly representation that facilitates downstream planning. It dissects BEV pipelines across camera-only, LiDAR-only, and multi-sensor fusion architectures, detailing view transformation, BEV encoders, and prediction heads, while noting industry practices and practical recipes. Core challenges identified include accurate depth estimation for camera BEV, robust cross-modal fusion, and generalization across diverse sensor configurations, with proposed directions toward transformer-based fusion, temporal BEV, and foundation-model integration. The work provides a pragmatic set of data augmentations, encoder designs, loss strategies, and ensemble/post-processing techniques to boost BEV performance on major benchmarks, offering a valuable reference for researchers and practitioners alike.

Abstract

Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. BEV perception inherits several advantages, as representing surrounding scenes in BEV is intuitive and fusion-friendly; and representing objects in BEV is most desirable for subsequent modules as in planning and/or control. The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; (c) how to formulate the pipeline to incorporate features from different sources and views; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios. In this survey, we review the most recent works on BEV perception and provide an in-depth analysis of different solutions. Moreover, several systematic designs of BEV approach from the industry are depicted as well. Furthermore, we introduce a full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs. At last, we point out the future research directions in this area. We hope this report will shed some light on the community and encourage more research effort on BEV perception. We keep an active repository to collect the most recent work and provide a toolbox for bag of tricks at https://github.com/OpenDriveLab/Birds-eye-view-Perception

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

TL;DR

The paper surveys bird's-eye-view (BEV) perception for autonomous driving, framing BEV as a unified, fusion-friendly representation that facilitates downstream planning. It dissects BEV pipelines across camera-only, LiDAR-only, and multi-sensor fusion architectures, detailing view transformation, BEV encoders, and prediction heads, while noting industry practices and practical recipes. Core challenges identified include accurate depth estimation for camera BEV, robust cross-modal fusion, and generalization across diverse sensor configurations, with proposed directions toward transformer-based fusion, temporal BEV, and foundation-model integration. The work provides a pragmatic set of data augmentations, encoder designs, loss strategies, and ensemble/post-processing techniques to boost BEV performance on major benchmarks, offering a valuable reference for researchers and practitioners alike.

Abstract

Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. BEV perception inherits several advantages, as representing surrounding scenes in BEV is intuitive and fusion-friendly; and representing objects in BEV is most desirable for subsequent modules as in planning and/or control. The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; (c) how to formulate the pipeline to incorporate features from different sources and views; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios. In this survey, we review the most recent works on BEV perception and provide an in-depth analysis of different solutions. Moreover, several systematic designs of BEV approach from the industry are depicted as well. Furthermore, we introduce a full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs. At last, we point out the future research directions in this area. We hope this report will shed some light on the community and encourage more research effort on BEV perception. We keep an active repository to collect the most recent work and provide a toolbox for bag of tricks at https://github.com/OpenDriveLab/Birds-eye-view-Perception
Paper Structure (69 sections, 15 equations, 7 figures, 8 tables)

This paper contains 69 sections, 15 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The general picture of BEV perception at a glance, where consists of three sub-parts based on the input modality. BEV perception is a general task built on top of a series of fundamental tasks. For better completeness of the whole perception algorithms in autonomous driving, we list other topics (e.g., Foundation Model) as well.
  • Figure 2: The general pipeline of BEV Camera (camera-only perception). There are three parts, including 2D feature extractor, view transformation and 3D decoder. In view transformation, there are two ways to encode 3D information - one is to predict depth information from 2D feature; the other is to sample 2D feature from 3D space.
  • Figure 3: Taxonomy of View Transformation. From the 2D-3D methods, LSS-based approaches philion2020liftreading2021categoricalhuang2021bevdethuang2022bevdet4dli2022bevdepthliu2022bevfusionliang2022bevfusion predict depth distribution per pixel from 2D feature. From the 3D-2D methods, homographic matrix based methods li2022bevformerchen2022persformergong2022gitnet presume sparse 3D sample points and project them to 2D plane via camera parameters. Pure-network-based methods pan2020crosshendy2020fishingchitta2021neatyang2021projectinggosala2022bird adopt MLP or transformer to implicitly model the projection from 3D space to 2D plane.
  • Figure 4: The general pipeline of BEV LiDAR perception. There are mainly two branches to convert point cloud data into BEV representation. The upper branch extracts point cloud features in 3D space, providing more accurate detection results. The lower branch extracts BEV features in 2D space, providing more efficient networks.
  • Figure 5: Two typical pipeline designs for BEV fusion algorithms, applicable to both academia and industry. The main difference lies in 2D to 3D conversion and fusion modules. In the PV perception pipeline (a), results of different algorithm are first transformed into 3D space, then fused using prior or hand-craft rules. The BEV perception pipeline (b) first transforms PV features to BEV, then fuses features to obtain the ultimate predictions, thereby maintaining most original information and avoiding hand-crafted design.
  • ...and 2 more figures