Table of Contents
Fetching ...

Depth-Aware Range Image-Based Model for Point Cloud Segmentation

Bike Chen, Antti Tikanmäki, Juha Röning

TL;DR

The paper addresses the challenge of range image–based PCS by exploiting the implicit but ordered depth information that standard backbones overlook. It introduces the Depth-Aware Module (DAM), which fuses global context via GAP with a sinusoidal positional encoding to produce depth-aware channel scales, and integrates DAM into the last block of each stage of Fast FMVNet V3. Through extensive experiments on SemanticKITTI, nuScenes, and SemanticPOSS, the approach achieves strong mIoU scores with a favorable speed-accuracy trade-off (e.g., $mIoU$ up to 69.6% on SemanticKITTI at 25.5 FPS) and demonstrates the generalizability of DAM to other range image–based models. The work suggests that depth-aware channel recalibration is a practical path to improve real-time PCS for outdoor robotics and opens avenues for range-image–based semantic SLAM.

Abstract

Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.

Depth-Aware Range Image-Based Model for Point Cloud Segmentation

TL;DR

The paper addresses the challenge of range image–based PCS by exploiting the implicit but ordered depth information that standard backbones overlook. It introduces the Depth-Aware Module (DAM), which fuses global context via GAP with a sinusoidal positional encoding to produce depth-aware channel scales, and integrates DAM into the last block of each stage of Fast FMVNet V3. Through extensive experiments on SemanticKITTI, nuScenes, and SemanticPOSS, the approach achieves strong mIoU scores with a favorable speed-accuracy trade-off (e.g., up to 69.6% on SemanticKITTI at 25.5 FPS) and demonstrates the generalizability of DAM to other range image–based models. The work suggests that depth-aware channel recalibration is a practical path to improve real-time PCS for outdoor robotics and opens avenues for range-image–based semantic SLAM.

Abstract

Point cloud segmentation (PCS) aims to separate points into different and meaningful groups. The task plays an important role in robotics because PCS enables robots to understand their physical environments directly. To process sparse and large-scale outdoor point clouds in real time, range image-based models are commonly adopted. However, in a range image, the lack of explicit depth information inevitably causes some separate objects in 3D space to touch each other, bringing difficulty for the range image-based models in correctly segmenting the objects. Moreover, previous PCS models are usually derived from the existing color image-based models and unable to make full use of the implicit but ordered depth information inherent in the range image, thereby achieving inferior performance. In this paper, we propose Depth-Aware Module (DAM) and Fast FMVNet V3. DAM perceives the ordered depth information in the range image by explicitly modelling the interdependence among channels. Fast FMVNet V3 incorporates DAM by integrating it into the last block in each architecture stage. Extensive experiments conducted on SemanticKITTI, nuScenes, and SemanticPOSS demonstrate that DAM brings a significant improvement for Fast FMVNet V3 with negligible computational cost.

Paper Structure

This paper contains 29 sections, 7 equations, 14 figures, 10 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a) In a point cloud, objects A and B remain separate along the depth direction after all points are voxelized minkowski2019spvnas_2020. (b) After the points are serialized by space-filling curves such as Z-order pointtransv32024, the objects A and B are also separate naturally in 3D space. We consider that the voxel and point representations have explicit depth information. (c) However, the oject A touches the object B in the range image although they are far apart in 3D space. We think that the range image contains implicit depth information.
  • Figure 2: (a) A color image kitti12 and the pixels visualized in 3D space where RGB values serve as the coordinates and colors. (b) A gray image and the pixels visualized in 3D space where the pixel values are transformed to the coordinates by a spherical model and the above RGB values act as the colors. In the color and gray images, the pixel value changes disorderly. Also, we cannot find any meaningful objects in the above "pixel clouds". (c) The range image and corresponding point cloud. The depth value in the range image varies with the distance of the target object from the LiDAR sensor. The objects in the point cloud are meaningful. We think that the depth values are ordered.
  • Figure 3: Overview of the proposed depth-aware module (DAM). First, feature maps $\boldsymbol{M}$ go through a global average pooling (GAP) layer to output the vector $\boldsymbol{g}$. The sinusoidal positional encoding (SPE) generates the vector $\boldsymbol{z}$. Then, both $\boldsymbol{g}$ and $\boldsymbol{z}$ pass through a shared multi-layer perceptron (MLP). The outputs from $\boldsymbol{g}$ and $\boldsymbol{z}$ are summed and go through a Sigmoid function to produce a scale $\boldsymbol{s}$. Finally, the scale $\boldsymbol{s}$ is multiplied by the feature maps $\boldsymbol{M}$ to make the depth-aware feature maps $\boldsymbol{M}^{\prime}$.
  • Figure 4: Left: visualization of sinusoidal position encoding attention2017; Right: sinusoids formed by three groups of position values obtained under the dimensions of 0, 10, and 20.
  • Figure 5: (a) Overview of the introduced Fast FMVNet V3. A point cloud is first projected onto the range image by scan unfolding++ filling_missing2024. Then, the range image goes through the backbone of Fast FMVNet V3, which contains four Stages, to output feature maps. Finally, the feature maps pass through the decoder part UPer Head upernet2018 and the post-processing module PDM pdm2024 to output pointwise predictions. In Fast FMVNet V3, the last ConvNeXt Block in each Stage is the depth-aware ConvNeXt Block (DConvNeXt Block) incorporating the proposed depth-aware module (DAM).
  • ...and 9 more figures