Table of Contents
Fetching ...

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, Jing Shao

TL;DR

Fast-BEV addresses the trade-off between accuracy and deployment efficiency in BEV-based autonomous perception by eliminating reliance on costly view transformers and depth prediction. It introduces a deployment-friendly Fast-Ray transformation with a Look-Up-Table and Multi-View to One-Voxel scheme, coupled with a multi-scale image encoder, an efficient BEV encoder, data augmentation, and temporal fusion. On nuScenes, it achieves competitive mAP and NDS while delivering real-time performance on on-vehicle chips, and it provides a practical CPU-friendly deployment benchmark. The work offers a simple, strong baseline for industry-ready, task-agnostic BEV perception on edge hardware and motivates further multimodal and multitask extensions.

Abstract

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/Fast-BEV.

Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline

TL;DR

Fast-BEV addresses the trade-off between accuracy and deployment efficiency in BEV-based autonomous perception by eliminating reliance on costly view transformers and depth prediction. It introduces a deployment-friendly Fast-Ray transformation with a Look-Up-Table and Multi-View to One-Voxel scheme, coupled with a multi-scale image encoder, an efficient BEV encoder, data augmentation, and temporal fusion. On nuScenes, it achieves competitive mAP and NDS while delivering real-time performance on on-vehicle chips, and it provides a practical CPU-friendly deployment benchmark. The work offers a simple, strong baseline for industry-ready, task-agnostic BEV perception on edge hardware and motivates further multimodal and multitask extensions.

Abstract

Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which leverages multi-scale information for better performance, (3) an efficient BEV encoder which is particularly designed to speed up on-vehicle inference. We further introduce (4) a strong data augmentation strategy for both image and BEV space to avoid over-fitting, (5) a multi-frame feature fusion mechanism to leverage the temporal information. Through experiments, on 2080Ti platform, our R50 model can run 52.6 FPS with 47.3% NDS on the nuScenes validation set, exceeding the 41.3 FPS and 47.5% NDS of the BEVDepth-R50 model and 30.2 FPS and 45.7% NDS of the BEVDet4D-R50 model. Our largest model (R101@900x1600) establishes a competitive 53.5% NDS on the nuScenes validation set. We further develop a benchmark with considerable accuracy and efficiency on current popular on-vehicle chips. The code is released at: https://github.com/Sense-GVT/Fast-BEV.
Paper Structure (20 sections, 3 equations, 8 figures, 15 tables, 2 algorithms)

This paper contains 20 sections, 3 equations, 8 figures, 15 tables, 2 algorithms.

Figures (8)

  • Figure 1: Methods comparison of view transformation. (a) Query-Based Transformation: methods with transformer's attention mechanism. (b) Depth-Based Transformation: methods with depth distribution prediction. (c) Fast-Ray Transformation (Ours): Uniform depth distribution along the camera ray with Look-Up-Table and Multi-View to One-Voxel operations.
  • Figure 2: Overview of Fast-BEV . It is consist of: ①Fast-Ray Transformation with pre-computing the image-to-voxel index (Look-Up-Table) and letting all cameras project to the same dense voxel (Multi-View to One-Voxel) to speed up projection time, ② Multi-Scale Image Encoder with Multi-Scale Projection to obtain multi-scale features, ③ Efficient BEV Encoder with efficient design to speed up inference time, ④ Data Augmentation on image and BEV domain to avoid over-fitting and achieve better performance, ⑤ Temporal Fusion module in BEV encoder stage to leverage multi-frame information.
  • Figure 3: It is a bird’s-eye view of each discrete voxel filled. (a) In basic view transformation, each camera has one sparse voxel (only $\sim$17% positions are non-zeros). An expensive aggregation operation is needed to combine the sparse voxels. (b) The proposed Fast-BEV let all cameras project to one dense voxel, avoiding the expensive voxel aggregation.
  • Figure 4: Multi-scale image encoder extracts multi-level features from multi-view images. $N$ images $\in R^{H \times W \times 3}$ as input and $F_{1/4}, F_{1/8}, F_{1/16}$ 3-level features as output.
  • Figure 5: Examples of the data augmentation used in Fast-BEV. The middle figure shows no data augmentation. The left figure shows the image augmentation and some augmentation types such as random flip, crop and rotate. The right figure shows one type of BEV augmentation, random rotation.
  • ...and 3 more figures