BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Zhenxin Li; Shiyi Lan; Jose M. Alvarez; Zuxuan Wu

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu

TL;DR

This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing the proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding.

Abstract

Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set. Code will be available at \url{https://github.com/woxihuanjiangguo/BEVNeXt}.

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

TL;DR

Abstract

Paper Structure (17 sections, 7 equations, 5 figures, 8 tables)

This paper contains 17 sections, 7 equations, 5 figures, 8 tables.

Introduction
Related Work
Dense BEV-based 3D Object Detection
Sparse Query-based 3D Object Detection
CRF for Dense Predictions
3D Object Detection with LiDAR sensors
Method
CRF-modulated Depth Estimation
Res2Fusion
Object Decoder with Perspective Refinement
Experiments
Implementation Details
Datasets and Metrics
Main Results
Ablation Studies
...and 2 more sections

Figures (5)

Figure 1: Previous SOTAs vs. BEVNeXt on the nuScenes 3D Object Detection Benchmark. On the nuScenes val split and test split, we compare BEVNeXt with previous SOTAs using ($\text{ResNet-50, }$bottom in the left panel), ($\text{ResNet-101, }$top in the left panel), and ($\text{VoVNet-99, }$right panel) as the backbone. BEVNeXt outperforms all previous sparse query-based ones in terms of comprehensive performance, meanwhile generating much fewer localization errors. The diameter of each bubble represents the mean Average Translation Error (mATE) each model produces. Higher and smaller bubbles are better. Best viewed in color.
Figure 2: Overall Architecture of BEVNeXt. The backbone first extracts multi-view image features, which are converted into depth distributions with a depth network and CRF modulation. The BEV feature at the current frame is fused with previous ones through a Res2Fusion module. Finally, a CenterPoint detection head, coupled with perspective refinement, generates object heatmaps and attributes.
Figure 3: Overview of Res2Fusion. We list three major types of BEV temporal fusion techniques: (a) parallel fusion, (b) recurrent fusion, and (c) Res2Fusion (used in BEVNeXt).
Figure 4: Comparison of Depth Estimation with and without CRF modulation on the nuScenes val split. We visualize depth ranges using an argmax operation on various depth bins. The CRF-modulated depth probabilities can distinguish objects from the background better.
Figure 5: Comparison of Detection Results with and without Perspective Refinement on the nuScenes val split. Compared with the coarse predictions of a CenterPoint head yin2021center, our refined objects are more aligned with the ground truths.

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

TL;DR

Abstract

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)