Table of Contents
Fetching ...

Point Cloud Based Scene Segmentation: A Survey

Dan Halperin, Niklas Eisl

TL;DR

This survey analyzes 3D point cloud semantic segmentation for autonomous driving, comparing projection-based, voxel-based, and hybrid approaches and their tradeoffs in accuracy and speed. It emphasizes how multi-representation fusion, such as combining voxel and point or range/bev projections, yields superior performance, and discusses the role of synthetic data (e.g., SynLiDAR) in mitigating real-world data limitations. Real-world benchmarks (SemanticKITTI, nuScenes) and synthetic datasets, along with metrics like mean IoU, are used to highlight current progress and gaps. The findings suggest that while projection-based methods are fast, 3D-aware and hybrid architectures generally achieve higher segmentation quality, and future work should explore temporal information, scene completion, and domain adaptation to further close the gap to robust autonomous driving systems.

Abstract

Autonomous driving is a safety-critical application, and it is therefore a top priority that the accompanying assistance systems are able to provide precise information about the surrounding environment of the vehicle. Tasks such as 3D Object Detection deliver an insufficiently detailed understanding of the surrounding scene because they only predict a bounding box for foreground objects. In contrast, 3D Semantic Segmentation provides richer and denser information about the environment by assigning a label to each individual point, which is of paramount importance for autonomous driving tasks, such as navigation or lane changes. To inspire future research, in this review paper, we provide a comprehensive overview of the current state-of-the-art methods in the field of Point Cloud Semantic Segmentation for autonomous driving. We categorize the approaches into projection-based, 3D-based and hybrid methods. Moreover, we discuss the most important and commonly used datasets for this task and also emphasize the importance of synthetic data to support research when real-world data is limited. We further present the results of the different methods and compare them with respect to their segmentation accuracy and efficiency.

Point Cloud Based Scene Segmentation: A Survey

TL;DR

This survey analyzes 3D point cloud semantic segmentation for autonomous driving, comparing projection-based, voxel-based, and hybrid approaches and their tradeoffs in accuracy and speed. It emphasizes how multi-representation fusion, such as combining voxel and point or range/bev projections, yields superior performance, and discusses the role of synthetic data (e.g., SynLiDAR) in mitigating real-world data limitations. Real-world benchmarks (SemanticKITTI, nuScenes) and synthetic datasets, along with metrics like mean IoU, are used to highlight current progress and gaps. The findings suggest that while projection-based methods are fast, 3D-aware and hybrid architectures generally achieve higher segmentation quality, and future work should explore temporal information, scene completion, and domain adaptation to further close the gap to robust autonomous driving systems.

Abstract

Autonomous driving is a safety-critical application, and it is therefore a top priority that the accompanying assistance systems are able to provide precise information about the surrounding environment of the vehicle. Tasks such as 3D Object Detection deliver an insufficiently detailed understanding of the surrounding scene because they only predict a bounding box for foreground objects. In contrast, 3D Semantic Segmentation provides richer and denser information about the environment by assigning a label to each individual point, which is of paramount importance for autonomous driving tasks, such as navigation or lane changes. To inspire future research, in this review paper, we provide a comprehensive overview of the current state-of-the-art methods in the field of Point Cloud Semantic Segmentation for autonomous driving. We categorize the approaches into projection-based, 3D-based and hybrid methods. Moreover, we discuss the most important and commonly used datasets for this task and also emphasize the importance of synthetic data to support research when real-world data is limited. We further present the results of the different methods and compare them with respect to their segmentation accuracy and efficiency.

Paper Structure

This paper contains 26 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Pioneering works can be categorized as either projection-based, voxel-based, or point-based. However, recent approaches typically combine different representations.
  • Figure 2: tang2020searching Smaller objects are no longer distinguishable at a low resolution. The left image shows a fine-grained 3D scene, while the right image has a coarse voxel resolution of $0.8\times 0.8\times 0.1$ [meters].
  • Figure 3: zhang2020polarnet A single grid cell consists of $n$ 4-dimensional points (3D coordinates and intensity), that are all independently processed with a PointNet to $n$ 512-dimensional representations. This is followed by a max-pooling operation to ensure that all grid cells have the same feature size of $1\times 512$.
  • Figure 4: zhou2020cylinder3d The range-view projection maps the 3D points onto a 2D plane (panoramic view). Although the three red points have significantly different depth coordinates, they are all projected onto the same pixel, resulting in a substantial loss of information.
  • Figure 5: tang2020searching The SPVC architecture consists of two branches: A voxel-based branch to allow for large receptive fields and a point-wise branch to preserve geometrical information.
  • ...and 2 more figures