Table of Contents
Fetching ...

3D Object Detection for Autonomous Driving: A Survey

Rui Qian, Xin Lai, Xirong Li

TL;DR

This survey addresses the problem of 3D object detection for autonomous driving, examining how images, LiDAR, and their fusion can robustly infer oriented 3D bounding boxes and headings. It introduces a modality-based taxonomy (image-based, point-cloud-based, and multimodal fusion), differentiates fusion paradigms (sequential vs parallel), and provides a comprehensive review of methods across voxel-based, point-based, and hybrid approaches, complemented by a 15-model case study with runtime, error, and robustness analyses. The work highlights that LiDAR-driven voxel/point methods currently offer strongest accuracy and efficiency, while multimodal fusion offers robustness but requires careful alignment. It also points to future needs in uncertainty-aware perception, end-to-end depth learning, and shape-driven representations to advance safe, reliable autonomous driving systems.

Abstract

Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.

3D Object Detection for Autonomous Driving: A Survey

TL;DR

This survey addresses the problem of 3D object detection for autonomous driving, examining how images, LiDAR, and their fusion can robustly infer oriented 3D bounding boxes and headings. It introduces a modality-based taxonomy (image-based, point-cloud-based, and multimodal fusion), differentiates fusion paradigms (sequential vs parallel), and provides a comprehensive review of methods across voxel-based, point-based, and hybrid approaches, complemented by a 15-model case study with runtime, error, and robustness analyses. The work highlights that LiDAR-driven voxel/point methods currently offer strongest accuracy and efficiency, while multimodal fusion offers robustness but requires careful alignment. It also points to future needs in uncertainty-aware perception, end-to-end depth learning, and shape-driven representations to advance safe, reliable autonomous driving systems.

Abstract

Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of perception stack especially for the sake of path planning, motion prediction, and collision avoidance etc. Taking a quick glance at the progress we have made, we attribute challenges to visual appearance recovery in the absence of depth information from images, representation learning from partially occluded unstructured point clouds, and semantic alignments over heterogeneous features from cross modalities. Despite existing efforts, 3D object detection for autonomous driving is still in its infancy. Recently, a large body of literature have been investigated to address this 3D vision task. Nevertheless, few investigations have looked into collecting and structuring this growing knowledge. We therefore aim to fill this gap in a comprehensive survey, encompassing all the main concerns including sensors, datasets, performance metrics and the recent state-of-the-art detection methods, together with their pros and cons. Furthermore, we provide quantitative comparisons with the state of the art. A case study on fifteen selected representative methods is presented, involved with runtime analysis, error analysis, and robustness analysis. Finally, we provide concluding remarks after an in-depth analysis of the surveyed works and identify promising directions for future work.

Paper Structure

This paper contains 35 sections, 4 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Levels of autonomous driving proposed by SAE (Society of Automotive Engineers) International sae-level. Where are we now?
  • Figure 2: An overview of 3D object detection task from images and point clouds. Typical challenges: (a) Point Miss. When LiDAR signals fail to return back from the surface of objects. (b) External Occlusion. When LiDAR signals are blocked by occluders in the vicinity. (c) Self Occlusion. When one near side of the object blocks the other, which makes point clouds 2.5D in practice. Note that bounding box prediction in (d) is much easier than that in (e) due to the sparsity of point clouds at long ranges.
  • Figure 3: A summary showing how this survey differs from existing ones on 3D object detection. Vertically, targeted scope concisely determines where the boundary is located among their investigations. Horizontally, hierarchical branches of this paper reveal a good continuity of existing efforts rahman2019recentarnold2019survey while adapt new branches (indicated in bold font) for dynamics, which importantly contributes to the maturity of the taxonomy on 3D object detection.
  • Figure 4: Comparisons of the 3D bounding box parameterization, between 8 corners proposed in chen2017multi, 4 corners with heights proposed in ku2018avod, the axis aligned box encoding proposed in dssa3d2016song, and the 7 parameters for an oriented 3D bounding box adopted in yan2018secondzhou2018voxelnetlang2019pointpillarsshi2019pointrcnnWeng_2019_ICCV_Workshops.
  • Figure 5: Pipeline of 3D object detection in general.Image based, which either lifts estimated 2D results into 3D space via template matching, geometric constraints in (a), or directly lifts 2D image features into 3D space via computing a Pseudo LiDAR, learning a latent depth distribution in (b). Point clouds based, which either voxelizes an irregular point cloud into regular voxel grids and then learn feature representation in an explicit way in (c), or leverage PointNet-like block, GNNs to learn permutation-invariant representations in an implicit fashion in (d). Multimodal fusion based, which is likely to fuse cross-modalities at early phase in (e), middle phase in (f), and late phase in (g) during the forward propagation.
  • ...and 3 more figures