Table of Contents
Fetching ...

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

Zhenxin Li, Shiyi Lan, Jose M. Alvarez, Zuxuan Wu

TL;DR

This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing the proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding.

Abstract

Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set. Code will be available at \url{https://github.com/woxihuanjiangguo/BEVNeXt}.

BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection

TL;DR

This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing the proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding.

Abstract

Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set. Code will be available at \url{https://github.com/woxihuanjiangguo/BEVNeXt}.
Paper Structure (17 sections, 7 equations, 5 figures, 8 tables)

This paper contains 17 sections, 7 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Previous SOTAs vs. BEVNeXt on the nuScenes 3D Object Detection Benchmark. On the nuScenes val split and test split, we compare BEVNeXt with previous SOTAs using ($\text{ResNet-50, }$bottom in the left panel), ($\text{ResNet-101, }$top in the left panel), and ($\text{VoVNet-99, }$right panel) as the backbone. BEVNeXt outperforms all previous sparse query-based ones in terms of comprehensive performance, meanwhile generating much fewer localization errors. The diameter of each bubble represents the mean Average Translation Error (mATE) each model produces. Higher and smaller bubbles are better. Best viewed in color.
  • Figure 2: Overall Architecture of BEVNeXt. The backbone first extracts multi-view image features, which are converted into depth distributions with a depth network and CRF modulation. The BEV feature at the current frame is fused with previous ones through a Res2Fusion module. Finally, a CenterPoint detection head, coupled with perspective refinement, generates object heatmaps and attributes.
  • Figure 3: Overview of Res2Fusion. We list three major types of BEV temporal fusion techniques: (a) parallel fusion, (b) recurrent fusion, and (c) Res2Fusion (used in BEVNeXt).
  • Figure 4: Comparison of Depth Estimation with and without CRF modulation on the nuScenes val split. We visualize depth ranges using an argmax operation on various depth bins. The CRF-modulated depth probabilities can distinguish objects from the background better.
  • Figure 5: Comparison of Detection Results with and without Perspective Refinement on the nuScenes val split. Compared with the coarse predictions of a CenterPoint head yin2021center, our refined objects are more aligned with the ground truths.