Table of Contents
Fetching ...

Vision-Centric BEV Perception: A Survey

Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, Dinesh Manocha, Xinge Zhu

TL;DR

This survey analyzes the evolution of vision-centric BEV perception, detailing four main PV-to-BEV paradigms—homography-based, depth-based, MLP-based, and transformer-based—and their downstream tasks like 3D object detection and BEV segmentation. It contrasts traditional IPM with learning-based depth lifting, highlights the rise of transformer-based view projectors (sparse/dense/hybrid queries), and discusses extensions such as multi-task learning, BEV fusion, and semantic occupancy prediction. The work synthesizes datasets, metrics, and empirical know-how, offering a comprehensive catalog of methods, performance trends, and practical guidelines to accelerate future research and deployment. It also points to community resources and datasets to foster reproducibility and cross-domain evaluation in autonomous driving scenarios.

Abstract

In recent years, vision-centric Bird's Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion. The rapid advancements in deep learning have led to the proposal of numerous methods for addressing vision-centric BEV perception challenges. However, there has been no recent survey encompassing this novel and burgeoning research field. To catalyze future research, this paper presents a comprehensive survey of the latest developments in vision-centric BEV perception and its extensions. It compiles and organizes up-to-date knowledge, offering a systematic review and summary of prevalent algorithms. Additionally, the paper provides in-depth analyses and comparative results on various BEV perception tasks, facilitating the evaluation of future works and sparking new research directions. Furthermore, the paper discusses and shares valuable empirical implementation details to aid in the advancement of related algorithms.

Vision-Centric BEV Perception: A Survey

TL;DR

This survey analyzes the evolution of vision-centric BEV perception, detailing four main PV-to-BEV paradigms—homography-based, depth-based, MLP-based, and transformer-based—and their downstream tasks like 3D object detection and BEV segmentation. It contrasts traditional IPM with learning-based depth lifting, highlights the rise of transformer-based view projectors (sparse/dense/hybrid queries), and discusses extensions such as multi-task learning, BEV fusion, and semantic occupancy prediction. The work synthesizes datasets, metrics, and empirical know-how, offering a comprehensive catalog of methods, performance trends, and practical guidelines to accelerate future research and deployment. It also points to community resources and datasets to foster reproducibility and cross-domain evaluation in autonomous driving scenarios.

Abstract

In recent years, vision-centric Bird's Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion. The rapid advancements in deep learning have led to the proposal of numerous methods for addressing vision-centric BEV perception challenges. However, there has been no recent survey encompassing this novel and burgeoning research field. To catalyze future research, this paper presents a comprehensive survey of the latest developments in vision-centric BEV perception and its extensions. It compiles and organizes up-to-date knowledge, offering a systematic review and summary of prevalent algorithms. Additionally, the paper provides in-depth analyses and comparative results on various BEV perception tasks, facilitating the evaluation of future works and sparking new research directions. Furthermore, the paper discusses and shares valuable empirical implementation details to aid in the advancement of related algorithms.
Paper Structure (40 sections, 14 figures, 10 tables)

This paper contains 40 sections, 14 figures, 10 tables.

Figures (14)

  • Figure 1: A taxonomy of algorithms for perspective view to bird's eye view. We categorize the methods for view transformation into four streams, following the development from non-deep approaches relying on geometry to deep ones involving learning. To clarify this development process and the differences among these streams, we write a separate sub-section for each stream to summarize the integration of subsequent methods with previous philosophies.
  • Figure 2: Chronological overview of homograph based PV to BEV methods.
  • Figure 3: Chronological overview of depth based PV to BEV methods.
  • Figure 4: Point-based methods transform 2D image pixels to Pseudo-LiDAR and use LiDAR-based approaches for 3D object detection.
  • Figure 5: The comparison of depth distribution between LSS LSS and OFT OFT.
  • ...and 9 more figures