Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

Yuxin Li; Qiang Han; Mengying Yu; Yuxin Jiang; Chaikiat Yeo; Yiheng Li; Zihang Huang; Nini Liu; Hsuanhan Chen; Xiaojun Wu

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

Yuxin Li, Qiang Han, Mengying Yu, Yuxin Jiang, Chaikiat Yeo, Yiheng Li, Zihang Huang, Nini Liu, Hsuanhan Chen, Xiaojun Wu

TL;DR

This study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Abstract

3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a prevalent approach in the field of autonomous driving. Despite the demonstrated improvements in accuracy and velocity estimation compared to perspective view methods, the deployment of BEV-based techniques in real-world autonomous vehicles remains challenging. This is primarily due to their reliance on vision-transformer (ViT) based architectures, which introduce quadratic complexity with respect to the input resolution. To address this issue, we propose an efficient BEV-based 3D detection framework called BEVENet, which leverages a convolutional-only architectural design to circumvent the limitations of ViT models while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3$\times$ faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

TL;DR

This study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Abstract

faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Paper Structure (22 sections, 2 equations, 6 figures, 3 tables)

This paper contains 22 sections, 2 equations, 6 figures, 3 tables.

Introduction
Related work
Backbone Models
3D Detection Methods
Methodology
Design Philosophy
Network Structure
Backbone Model
View Projection
Depth Prediction
Temporal Fusion and BEV Encoder
Detection Head
EXPERIMENTS
Experiment Settings
Performance Benchmark
...and 7 more sections

Figures (6)

Figure 1: The comparison of inference speed over different state-of-the-art methods
Figure 2: Overall Architecture of BEVENet. BEVENet consists of six major modules: Backbone, View Projector, Depth Estimator, Temporal Fuser, BEV Encoder and Detection Head. During the inference stage, only multi-view camera input is needed in the pipeline, whereas during training, LiDAR points are included as a rich source of supervision signals for the depth estimation module. Refer to Figure \ref{['fig:depth']} for more details.
Figure 3: Illustration of View Projection. Camera images from the 2D domain are lifted to the 3D space along the light ray; projection is made in both the horizontal and vertical directions.
Figure 4: Illustration of the Depth Module. We adopt the same design as BEVDepth li2022bevdepth in depth estimation module, but add the augmentation matrix and extrinsic parameters together with the intrinsic parameters as input to the depth estimation network. The MLP layer is also being replaced by a convolutional network.
Figure 5: Illustration of Detection Head Simplification by Re-Parameterization. Compared to the original detection head, we combine the output nodes mathematically by their values, which will generate identical results but with fewer multiplication operations.
...and 1 more figures

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

TL;DR

Abstract

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (6)