Table of Contents
Fetching ...

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

Di Feng, Christian Haase-Schütz, Lars Rosenbaum, Heinz Hertlein, Claudius Glaeser, Fabian Timm, Werner Wiesbeck, Klaus Dietmayer

TL;DR

The paper addresses robust, real-time perception for autonomous driving by surveying deep multi-modal object detection and semantic segmentation. It systematically analyzes sensing modalities, datasets, and fusion methodologies, focusing on what to fuse, how to fuse, and when to fuse, with emphasis on LiDAR-camera integration and the emerging role of Radar. Key contributions include a taxonomy of fusion approaches, a synthesis of datasets (2013–2019), and a discussion of open challenges such as data diversity, alignment, uncertainty modeling, and real-time performance, complemented by an interactive reference platform. Overall, the work guides researchers and practitioners toward more robust, sensor-aware multi-modal perception pipelines and highlights radar fusion and uncertainty-aware methods as promising directions for future work.

Abstract

Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/.

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges

TL;DR

The paper addresses robust, real-time perception for autonomous driving by surveying deep multi-modal object detection and semantic segmentation. It systematically analyzes sensing modalities, datasets, and fusion methodologies, focusing on what to fuse, how to fuse, and when to fuse, with emphasis on LiDAR-camera integration and the emerging role of Radar. Key contributions include a taxonomy of fusion approaches, a synthesis of datasets (2013–2019), and a discussion of open challenges such as data diversity, alignment, uncertainty modeling, and real-time performance, complemented by an interactive reference platform. Overall, the work guides researchers and practitioners toward more robust, sensor-aware multi-modal perception pipelines and highlights radar fusion and uncertainty-aware methods as promising directions for future work.

Abstract

Recent advancements in perception for autonomous driving are driven by deep learning. In order to achieve robust and accurate scene understanding, autonomous vehicles are usually equipped with different sensors (e.g. cameras, LiDARs, Radars), and multiple sensing modalities can be fused to exploit their complementary properties. In this context, many methods have been proposed for deep multi-modal perception problems. However, there is no general guideline for network architecture design, and questions of "what to fuse", "when to fuse", and "how to fuse" remain open. This review paper attempts to systematically summarize methodologies and discuss challenges for deep multi-modal object detection and semantic segmentation in autonomous driving. To this end, we first provide an overview of on-board sensors on test vehicles, open datasets, and background information for object detection and semantic segmentation in autonomous driving research. We then summarize the fusion methodologies and discuss challenges and open questions. In the appendix, we provide tables that summarize topics and methods. We also provide an interactive online platform to navigate each reference: https://boschresearch.github.io/multimodalperception/.

Paper Structure

This paper contains 57 sections, 6 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: A complex urban scenario for autonomous driving. The driverless car uses multi-modal signals for perception, such as RGB camera images, LiDAR points, Radar points, and map information. It needs to perceive all relevant traffic participants and objects accurately, robustly, and in real-time. For clarity, only the bounding boxes and classification scores for some objects are drawn in the image. The RGB image is adapted from neuhold2017mapillary.
  • Figure 2: Average precision (AP) vs. runtime. Visualized are deep learning approaches that use LiDAR, camera, or both as inputs for car detection on the KITTI bird's eye view test dataset. Moderate APs are summarized. The results are mainly based on the KITTI leader-board Geiger2012CVPR (visited on Apr. 20, 2019). On the leader-board only the published methods are considered.
  • Figure 3: (a) The Boss autonomous car at DARPA 2007 urmson2008autonomous, (b) Waymo self-driving car waymo2017autonomous.
  • Figure 4: The Faster R-CNN object detection network. It consists of three parts: a pre-processing network to extract high-level image features, a Region Proposal Network (RPN) that produces region proposals, and a Faster-RCNN head which fine-tunes each region proposal.
  • Figure 5: (a). Normalized percentage of objects of car, person, and cyclist classes in KAIST Multispectral choi2018kaist, KITTI Geiger2012CVPR, Apolloscape apolloscape_arXiv_2018 (E: easy, M: moderate, and H: hard refer to the number of moveable objects in the frame - details can be found in apolloscape_arXiv_2018), and nuScene dataset nuscenes2019. (b). Number of camera image frames in several datasets. An increase by two orders of magnitude of the dataset size can be seen.
  • ...and 7 more figures