Table of Contents
Fetching ...

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

Ziying Song, Lin Liu, Feiyang Jia, Yadan Luo, Guoxin Zhang, Lei Yang, Li Wang, Caiyan Jia

TL;DR

This review addresses robustness in 3D object detection for autonomous driving, emphasizing that practical safety requires maintaining performance under environmental variations, sensor noise, and calibration misalignment. It surveys camera-only, LiDAR-only, and multi-modal detectors, introducing a robustness-focused taxonomy and evaluating methods on corruption benchmarks such as KITTI-C and nuScenes-C to compare accuracy, latency, and robustness. The analysis shows multi-modal fusion generally offers superior robustness, while single-modality approaches are more vulnerable to noise and environmental changes, underscoring the need for robustness-aware design and evaluation. The paper aims to guide deployment and future research toward robustness-centric, real-world-ready perception systems for safe autonomous driving.

Abstract

In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. The key step to this system is related to 3D object detection that utilizes vehicle-mounted sensors such as LiDAR and cameras to identify the size, the category, and the location of nearby objects. Despite the surge in 3D object detection methods aimed at enhancing detection precision and efficiency, there is a gap in the literature that systematically examines their resilience against environmental variations, noise, and weather changes. This study emphasizes the importance of robustness, alongside accuracy and latency, in evaluating perception systems under practical scenarios. Our work presents an extensive survey of camera-only, LiDAR-only, and multi-modal 3D object detection algorithms, thoroughly evaluating their trade-off between accuracy, latency, and robustness, particularly on datasets like KITTI-C and nuScenes-C to ensure fair comparisons. Among these, multi-modal 3D detection approaches exhibit superior robustness, and a novel taxonomy is introduced to reorganize the literature for enhanced clarity. This survey aims to offer a more practical perspective on the current capabilities and the constraints of 3D object detection algorithms in real-world applications, thus steering future research towards robustness-centric advancements.

Robustness-Aware 3D Object Detection in Autonomous Driving: A Review and Outlook

TL;DR

This review addresses robustness in 3D object detection for autonomous driving, emphasizing that practical safety requires maintaining performance under environmental variations, sensor noise, and calibration misalignment. It surveys camera-only, LiDAR-only, and multi-modal detectors, introducing a robustness-focused taxonomy and evaluating methods on corruption benchmarks such as KITTI-C and nuScenes-C to compare accuracy, latency, and robustness. The analysis shows multi-modal fusion generally offers superior robustness, while single-modality approaches are more vulnerable to noise and environmental changes, underscoring the need for robustness-aware design and evaluation. The paper aims to guide deployment and future research toward robustness-centric, real-world-ready perception systems for safe autonomous driving.

Abstract

In the realm of modern autonomous driving, the perception system is indispensable for accurately assessing the state of the surrounding environment, thereby enabling informed prediction and planning. The key step to this system is related to 3D object detection that utilizes vehicle-mounted sensors such as LiDAR and cameras to identify the size, the category, and the location of nearby objects. Despite the surge in 3D object detection methods aimed at enhancing detection precision and efficiency, there is a gap in the literature that systematically examines their resilience against environmental variations, noise, and weather changes. This study emphasizes the importance of robustness, alongside accuracy and latency, in evaluating perception systems under practical scenarios. Our work presents an extensive survey of camera-only, LiDAR-only, and multi-modal 3D object detection algorithms, thoroughly evaluating their trade-off between accuracy, latency, and robustness, particularly on datasets like KITTI-C and nuScenes-C to ensure fair comparisons. Among these, multi-modal 3D detection approaches exhibit superior robustness, and a novel taxonomy is introduced to reorganize the literature for enhanced clarity. This survey aims to offer a more practical perspective on the current capabilities and the constraints of 3D object detection algorithms in real-world applications, thus steering future research towards robustness-centric advancements.
Paper Structure (60 sections, 6 equations, 8 figures, 9 tables)

This paper contains 60 sections, 6 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: An illustration of 3D object detection in autonomous driving scenarios with different sensors.
  • Figure 2: The general pipeline of Camera-only methods.
  • Figure 3: (a) The $AP_{\text{3D}}$ comparison of monocular-based methods liu2020smokeliu2022learningqin2019monogrnetchen2020monopairmonodistillMonoDTRzhang2022monodetrlian2022monojsgyang2023mixshi2021geometry and stereo-based methods pon2020objectxu2020zoomnetli2021rts3dliu2021yolostereo3dgao2022esgndisprcnnchen2022dmfchen2022dsgn++chen2020dsgngarg2020cdn on KITTI test dataset. (b) The mAP (left) and NDS (right) comparison of monocular-based methods Centernetsimonelli2019disentanglingwang2022probabilisticpark2021pseudowang2021fcos3d and Multi-view methods wang2022detr3dliu2022petrliu2023petrv2bevformerbevformerv2park2022timebevdepthlin2023sparse4dv3wang2023exploringjiang2023far3d on the nuScenes test dataset. (c) The $AP_{\text{3D}}$ comparison of View-based methods rangecdRangeDetrangeioudetrangercnn, Voxel-based methods SecondPointpillarspart2VoxelrcnnPDVvoxeltransformerPG-RCNNTED, Point-based PI-RCNNPointrcnn3DIoU-NetPoint-gnn3dssdIA-SSDsasa, and Point-Voxel-based methods StdlidarrcnnHVPRPv-rcnnpyramidrcnn on KITTI test dataset. (d) The mAP (left) and NDS (right) comparison of Voxel-based methods PointpillarsCenternetCenterpointfocalconvUVTRpillarnetvoxelnextTransfusionFocalFormer3D and Point-based methods 3dssd on the nuScenes test dataset. (e) The $AP_{\text{3D}}$ comparison of Point-Projection-based (P.P.) methods Pointpaintingsindagi2019mvxEpnetepnetpami, Feature-Projection-based (F.P.) methods mmffocalconvSupFusion, Auto-Projection-based (A.P.) methods PI-RCNN3dcvfHMFI3DDualFusionRobust-FusionNetgraphalignLogonet, Decision-Projection-based (D.P.) methods mv3davodFrustum-pointpillarsFrustumconvnetroifusionFast-CLOCsCLOCs, and Query-Learning-based (Q.L.) methods CAT-Det on KITTI test dataset. (f) The mAP (left) and NDS (right) comparison of Point-Projection-based (P.P.) methods wang2021pointaugmenting, Feature-Projection-based (F.P.) methods LargeKernel3D, Auto-Projection-based (A.P.) methods autoalignv2graphalign, Query-Learning-based (Q.L.) methods autoalignTransfusionDeepInteraction and Unified-Feature-based (U.F.) methods UVTRFUTR3DsparsefusionBEVFusionMSMDFusioncmtUniTRcai2023bevfusion4dFocalFormer3DEA-BEV on the nuScenes test dataset.
  • Figure 4: Corruption examples in the RoboBEVRoboBEV benchmark: simulating camera malfunction.
  • Figure 5: The genaral piplines for LiDAR-only 3D object detection.
  • ...and 3 more figures