Table of Contents
Fetching ...

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

Yuxin Li, Qiang Han, Mengying Yu, Yuxin Jiang, Chaikiat Yeo, Yiheng Li, Zihang Huang, Nini Liu, Hsuanhan Chen, Xiaojun Wu

TL;DR

This study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Abstract

3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a prevalent approach in the field of autonomous driving. Despite the demonstrated improvements in accuracy and velocity estimation compared to perspective view methods, the deployment of BEV-based techniques in real-world autonomous vehicles remains challenging. This is primarily due to their reliance on vision-transformer (ViT) based architectures, which introduce quadratic complexity with respect to the input resolution. To address this issue, we propose an efficient BEV-based 3D detection framework called BEVENet, which leverages a convolutional-only architectural design to circumvent the limitations of ViT models while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3$\times$ faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Towards Efficient 3D Object Detection in Bird's-Eye-View Space for Autonomous Driving: A Convolutional-Only Approach

TL;DR

This study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.

Abstract

3D object detection in Bird's-Eye-View (BEV) space has recently emerged as a prevalent approach in the field of autonomous driving. Despite the demonstrated improvements in accuracy and velocity estimation compared to perspective view methods, the deployment of BEV-based techniques in real-world autonomous vehicles remains challenging. This is primarily due to their reliance on vision-transformer (ViT) based architectures, which introduce quadratic complexity with respect to the input resolution. To address this issue, we propose an efficient BEV-based 3D detection framework called BEVENet, which leverages a convolutional-only architectural design to circumvent the limitations of ViT models while maintaining the effectiveness of BEV-based methods. Our experiments show that BEVENet is 3 faster than contemporary state-of-the-art (SOTA) approaches on the NuScenes challenge, achieving a mean average precision (mAP) of 0.456 and a nuScenes detection score (NDS) of 0.555 on the NuScenes validation dataset, with an inference speed of 47.6 frames per second. To the best of our knowledge, this study stands as the first to achieve such significant efficiency improvements for BEV-based methods, highlighting their enhanced feasibility for real-world autonomous driving applications.
Paper Structure (22 sections, 2 equations, 6 figures, 3 tables)

This paper contains 22 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The comparison of inference speed over different state-of-the-art methods
  • Figure 2: Overall Architecture of BEVENet. BEVENet consists of six major modules: Backbone, View Projector, Depth Estimator, Temporal Fuser, BEV Encoder and Detection Head. During the inference stage, only multi-view camera input is needed in the pipeline, whereas during training, LiDAR points are included as a rich source of supervision signals for the depth estimation module. Refer to Figure \ref{['fig:depth']} for more details.
  • Figure 3: Illustration of View Projection. Camera images from the 2D domain are lifted to the 3D space along the light ray; projection is made in both the horizontal and vertical directions.
  • Figure 4: Illustration of the Depth Module. We adopt the same design as BEVDepth li2022bevdepth in depth estimation module, but add the augmentation matrix and extrinsic parameters together with the intrinsic parameters as input to the depth estimation network. The MLP layer is also being replaced by a convolutional network.
  • Figure 5: Illustration of Detection Head Simplification by Re-Parameterization. Compared to the original detection head, we combine the output nodes mathematically by their values, which will generate identical results but with fewer multiplication operations.
  • ...and 1 more figures