Table of Contents
Fetching ...

Enhancing 3D LiDAR Segmentation by Shaping Dense and Accurate 2D Semantic Predictions

Xiaoyu Dong, Tiankui Xian, Wanshui Gan, Naoto Yokoya

TL;DR

A multi-modal segmentation model, MM2D3D, is developed that enables intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy.

Abstract

Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real-world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi-modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross-modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.

Enhancing 3D LiDAR Segmentation by Shaping Dense and Accurate 2D Semantic Predictions

TL;DR

A multi-modal segmentation model, MM2D3D, is developed that enables intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy.

Abstract

Semantic segmentation of 3D LiDAR point clouds is important in urban remote sensing for understanding real-world street environments. This task, by projecting LiDAR point clouds and 3D semantic labels as sparse maps, can be reformulated as a 2D problem. However, the intrinsic sparsity of the projected LiDAR and label maps can result in sparse and inaccurate intermediate 2D semantic predictions, which in return limits the final 3D accuracy. To address this issue, we enhance this task by shaping dense and accurate 2D predictions. Specifically, we develop a multi-modal segmentation model, MM2D3D. By leveraging camera images as auxiliary data, we introduce cross-modal guided filtering to overcome label map sparsity by constraining intermediate 2D semantic predictions with dense semantic relations derived from the camera images; and we introduce dynamic cross pseudo supervision to overcome LiDAR map sparsity by encouraging the 2D predictions to emulate the dense distribution of the semantic predictions from the camera images. Experiments show that our techniques enable our model to achieve intermediate 2D semantic predictions with dense distribution and higher accuracy, which effectively enhances the final 3D accuracy. Comparisons with previous methods demonstrate our superior performance in both 2D and 3D spaces.
Paper Structure (18 sections, 9 equations, 8 figures, 5 tables)

This paper contains 18 sections, 9 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Intermediate 2D semantic predictions. Inputs are LiDAR map and camera image. P3D Label indicates label map (shown stacked on camera image) and is used as supervision for training. 2D Label is only used for evaluation. (b) Comparison of 2D and 3D accuracy on our nuScenes2D3D test set. The 2D accuracy of RangeViT-CS cvpr23_rangevit, RangeViT-IN21k cvpr23_rangevit, and EPMF pami24_epmf is not available because their intermediate 2D predictions and 2D labels are from different projection views or misaligned.
  • Figure 2: An illustration of our MM2D3D model. The cross-modal guided filtering constrains intermediate 2D semantic predictions with dense semantic relations derived from camera images to increase accuracy in unlabeled regions. The dynamic cross pseudo supervision encourages the intermediate 2D semantic predictions to emulate the dense distribution of camera semantic predictions.
  • Figure 3: (a) Our cross-modal guided filtering. (b) Tree affinity generation: In the 4-connected planar graph, each vertex has four neighbors. The minimum spanning tree includes all vertices in the graph, ensuring connectivity. We generate the affinity matrix based on the distances between vertices in the tree.
  • Figure 4: Effect of our cross-modal guided filtering and dynamic cross pseudo supervision on intermediate 2D semantic predictions. Only sparse label maps are used for training.
  • Figure 5: Intermediate 2D semantic predictions from models employed different learning manners in cross-modal guided filtering. Only sparse label maps are employed for training.
  • ...and 3 more figures