Table of Contents
Fetching ...

EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation

Mingkui Tan, Zhuangwei Zhuang, Sitao Chen, Rong Li, Kui Jia, Qicheng Wang, Yuanqing Li

TL;DR

This paper tackles 3D semantic segmentation by addressing the modality gap between RGB appearance and LiDAR depth. It introduces PMF, a perception-aware fusion framework that projects LiDAR into camera coordinates and uses a two-stream network with residual fusion and perception-aware losses to jointly leverage appearance and depth. An enhanced version, EPMF, optimizes data pre-processing and network architecture under perspective projection (including cross-modal alignment, cropping, and efficient contextual modules) to boost efficiency and accuracy. Extensive experiments on SemanticKITTI-FV, nuScenes, and A2D2 show that EPMF achieves state-of-the-art or competitive performance across datasets and distances, with notable improvements in mIoU over LiDAR-only and existing fusion methods, while maintaining practical inference speeds.

Abstract

We study multi-sensor fusion for 3D semantic segmentation that is important to scene understanding for many applications, such as autonomous driving and robotics. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between the two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to effectively exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we project point clouds to the camera coordinate using perspective projection, and process both inputs from LiDAR and cameras in 2D space while preventing the information loss of RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately. The extracted features are fused by effective residual-based fusion modules. Moreover, we introduce additional perception-aware losses to measure the perceptual difference between the two modalities. Last, we propose an improved version of PMF, i.e., EPMF, which is more efficient and effective by optimizing data pre-processing and network architecture under perspective projection. Specifically, we propose cross-modal alignment and cropping to obtain tight inputs and reduce unnecessary computational costs. We then explore more efficient contextual modules under perspective projection and fuse the LiDAR features into the camera stream to boost the performance of the two-stream network. Extensive experiments on benchmark data sets show the superiority of our method. For example, on nuScenes test set, our EPMF outperforms the state-of-the-art method, i.e., RangeFormer, by 0.9% in mIoU. Our source code is available at https://github.com/ICEORY/PMF.

EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation

TL;DR

This paper tackles 3D semantic segmentation by addressing the modality gap between RGB appearance and LiDAR depth. It introduces PMF, a perception-aware fusion framework that projects LiDAR into camera coordinates and uses a two-stream network with residual fusion and perception-aware losses to jointly leverage appearance and depth. An enhanced version, EPMF, optimizes data pre-processing and network architecture under perspective projection (including cross-modal alignment, cropping, and efficient contextual modules) to boost efficiency and accuracy. Extensive experiments on SemanticKITTI-FV, nuScenes, and A2D2 show that EPMF achieves state-of-the-art or competitive performance across datasets and distances, with notable improvements in mIoU over LiDAR-only and existing fusion methods, while maintaining practical inference speeds.

Abstract

We study multi-sensor fusion for 3D semantic segmentation that is important to scene understanding for many applications, such as autonomous driving and robotics. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between the two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to effectively exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we project point clouds to the camera coordinate using perspective projection, and process both inputs from LiDAR and cameras in 2D space while preventing the information loss of RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately. The extracted features are fused by effective residual-based fusion modules. Moreover, we introduce additional perception-aware losses to measure the perceptual difference between the two modalities. Last, we propose an improved version of PMF, i.e., EPMF, which is more efficient and effective by optimizing data pre-processing and network architecture under perspective projection. Specifically, we propose cross-modal alignment and cropping to obtain tight inputs and reduce unnecessary computational costs. We then explore more efficient contextual modules under perspective projection and fuse the LiDAR features into the camera stream to boost the performance of the two-stream network. Extensive experiments on benchmark data sets show the superiority of our method. For example, on nuScenes test set, our EPMF outperforms the state-of-the-art method, i.e., RangeFormer, by 0.9% in mIoU. Our source code is available at https://github.com/ICEORY/PMF.

Paper Structure

This paper contains 27 sections, 20 equations, 12 figures, 14 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparisons of spherical projection milioto2019rangenet++wu2018squeezeseg and perspective projection. With spherical projection, most of the appearance information from RGB images is lost. Instead, we preserve the information of images with perspective projection. To distinguish different classes, we colorize the point clouds using semantic labels from SemanticKITTI.
  • Figure 2: Comparisons of the predictions from images and point clouds. Deep neural networks capture different perceptual information from RGB images and point clouds. Red indicates predictions with higher scores.
  • Figure 3: Comparisons of efficiency and performance of different methods on SemanticKITTI-FV.
  • Figure 4: Illustration of the training and inference schemes of EPMF. EPMF consists of three components: (1) perspective projection with cross-modal alignment and crop; (2) a two-stream network (TSNet) with feature fusion modules; and (3) perception-aware losses ${\mathcal{L}}_{per},\widetilde{{\mathcal{L}}}_{per}$ w.r.t. the camera stream and the LiDAR stream. We first project the point clouds to the camera coordinate with perspective projection and learn the features from both the RGB images and point clouds using TSNet. The image features are fused into the LiDAR stream network by fusion modules. In the training procedure, we use perception-aware losses to help the network focus on the perceptual features of both images and point clouds. In the inference procedure, we apply dense-to-sparse mapping to obtain 3D segmentation results of point clouds.
  • Figure 5: Illustration of the residual-based fusion (RF) module. RF fuses features from both the camera and LiDAR to generate the complementary information of the original LiDAR features.
  • ...and 7 more figures