FastBEV++: Fast by Algorithm, Deployable by Design

Yuanpeng Chen; Hui Song; Wei Tao; ShanHui Mo; Shuang Zhang; Xiao Hua; TianKun Zhao

FastBEV++: Fast by Algorithm, Deployable by Design

Yuanpeng Chen, Hui Song, Wei Tao, ShanHui Mo, Shuang Zhang, Xiao Hua, TianKun Zhao

TL;DR

To address the trade-off between accuracy and deployment feasibility in camera-only BEV, the authors propose FastBEV++ which reframes the view transformation as an Index-Gather-Reshape pipeline and uses deterministic pre-sorting to create a hardware-agnostic, plugin-free implementation. The method additionally enables depth-aware fusion integrated into the gather stage, boosting geometric fidelity without resorting to heavy attention or voxel-based pooling. Experimental results on nuScenes show state-of-the-art NDS and real-time performance on automotive hardware, including 134 FPS on Tesla T4 with INT8. The work demonstrates that deploying constraints can catalyze stronger perceptual models and offers a scalable blueprint for production autonomous systems.

Abstract

The advancement of camera-only Bird's-Eye-View(BEV) perception is currently impeded by a fundamental tension between state-of-the-art performance and on-vehicle deployment tractability. This bottleneck stems from a deep-rooted dependency on computationally prohibitive view transformations and bespoke, platform-specific kernels. This paper introduces FastBEV++, a framework engineered to reconcile this tension, demonstrating that high performance and deployment efficiency can be achieved in unison via two guiding principles: Fast by Algorithm and Deployable by Design. We realize the "Deployable by Design" principle through a novel view transformation paradigm that decomposes the monolithic projection into a standard Index-Gather-Reshape pipeline. Enabled by a deterministic pre-sorting strategy, this transformation is executed entirely with elementary, operator native primitives (e.g Gather, Matrix Multiplication), which eliminates the need for specialized CUDA kernels and ensures fully TensorRT-native portability. Concurrently, our framework is "Fast by Algorithm", leveraging this decomposed structure to seamlessly integrate an end-to-end, depth-aware fusion mechanism. This jointly learned depth modulation, further bolstered by temporal aggregation and robust data augmentation, significantly enhances the geometric fidelity of the BEV representation.Empirical validation on the nuScenes benchmark corroborates the efficacy of our approach. FastBEV++ establishes a new state-of-the-art 0.359 NDS while maintaining exceptional real-time performance, exceeding 134 FPS on automotive-grade hardware (e.g Tesla T4). By offering a solution that is free of custom plugins yet highly accurate, FastBEV++ presents a mature and scalable design philosophy for production autonomous systems. The code is released at: https://github.com/ymlab/advanced-fastbev

FastBEV++: Fast by Algorithm, Deployable by Design

TL;DR

Abstract

FastBEV++: Fast by Algorithm, Deployable by Design

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)