Table of Contents
Fetching ...

Trainable Pointwise Decoder Module for Point Cloud Segmentation

Bike Chen, Chen Gong, Antti Tikanmäki, Juha Röning

TL;DR

This work tackles information loss and mislabeling in range image-based point cloud segmentation by introducing a trainable Pointwise Decoder Module (PDM) that combines a range image-guided $K$NN search with a lightweight local feature extractor, enabling end-to-end training with existing backbones. To address augmentation-induced artifacts and uncontrolled point growth, the authors propose Virtual Range Image-guided Copy-rotate-paste (VRCrop), which confines point counts and preserves semantics during training. Across SemanticKITTI, SemanticPOSS, and nuScenes, PDM-enabled models consistently surpass traditional post-processing methods (KNN, NLA, KPConv) in $mIoU$ while maintaining favorable inference speed. Collectively, PDM and VRCrop advance range image PCS by enabling robust per-point predictions under varying-density outdoor scenes, offering a practical path toward improved robotic perception systems. Future work could explore integrating clustering-based strategies to further mitigate artifacts behind pasted instances.

Abstract

Point cloud segmentation (PCS) aims to make per-point predictions and enables robots and autonomous driving cars to understand the environment. The range image is a dense representation of a large-scale outdoor point cloud, and segmentation models built upon the image commonly execute efficiently. However, the projection of the point cloud onto the range image inevitably leads to dropping points because, at each image coordinate, only one point is kept despite multiple points being projected onto the same location. More importantly, it is challenging to assign correct predictions to the dropped points that belong to the classes different from the kept point class. Besides, existing post-processing methods, such as K-nearest neighbor (KNN) search and kernel point convolution (KPConv), cannot be trained with the models in an end-to-end manner or cannot process varying-density outdoor point clouds well, thereby enabling the models to achieve sub-optimal performance. To alleviate this problem, we propose a trainable pointwise decoder module (PDM) as the post-processing approach, which gathers weighted features from the neighbors and then makes the final prediction for the query point. In addition, we introduce a virtual range image-guided copy-rotate-paste (VRCrop) strategy in data augmentation. VRCrop constrains the total number of points and eliminates undesirable artifacts in the augmented point cloud. With PDM and VRCrop, existing range image-based segmentation models consistently perform better than their counterparts on the SemanticKITTI, SemanticPOSS, and nuScenes datasets.

Trainable Pointwise Decoder Module for Point Cloud Segmentation

TL;DR

This work tackles information loss and mislabeling in range image-based point cloud segmentation by introducing a trainable Pointwise Decoder Module (PDM) that combines a range image-guided NN search with a lightweight local feature extractor, enabling end-to-end training with existing backbones. To address augmentation-induced artifacts and uncontrolled point growth, the authors propose Virtual Range Image-guided Copy-rotate-paste (VRCrop), which confines point counts and preserves semantics during training. Across SemanticKITTI, SemanticPOSS, and nuScenes, PDM-enabled models consistently surpass traditional post-processing methods (KNN, NLA, KPConv) in while maintaining favorable inference speed. Collectively, PDM and VRCrop advance range image PCS by enabling robust per-point predictions under varying-density outdoor scenes, offering a practical path toward improved robotic perception systems. Future work could explore integrating clustering-based strategies to further mitigate artifacts behind pasted instances.

Abstract

Point cloud segmentation (PCS) aims to make per-point predictions and enables robots and autonomous driving cars to understand the environment. The range image is a dense representation of a large-scale outdoor point cloud, and segmentation models built upon the image commonly execute efficiently. However, the projection of the point cloud onto the range image inevitably leads to dropping points because, at each image coordinate, only one point is kept despite multiple points being projected onto the same location. More importantly, it is challenging to assign correct predictions to the dropped points that belong to the classes different from the kept point class. Besides, existing post-processing methods, such as K-nearest neighbor (KNN) search and kernel point convolution (KPConv), cannot be trained with the models in an end-to-end manner or cannot process varying-density outdoor point clouds well, thereby enabling the models to achieve sub-optimal performance. To alleviate this problem, we propose a trainable pointwise decoder module (PDM) as the post-processing approach, which gathers weighted features from the neighbors and then makes the final prediction for the query point. In addition, we introduce a virtual range image-guided copy-rotate-paste (VRCrop) strategy in data augmentation. VRCrop constrains the total number of points and eliminates undesirable artifacts in the augmented point cloud. With PDM and VRCrop, existing range image-based segmentation models consistently perform better than their counterparts on the SemanticKITTI, SemanticPOSS, and nuScenes datasets.
Paper Structure (30 sections, 1 equation, 11 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 1 equation, 11 figures, 12 tables, 1 algorithm.

Figures (11)

  • Figure 1: (a) When points are projected onto the range image, some points might be projected onto the same location due to the discretization. For example, points A and B are mapped to the same position (i.e., the blue grid) in the range image. (b) Correspondingly, when per-pixel predictions are projected back onto the points, the points belonging to different classes might be assigned the same label. For instance, the A and B belong to different classes but get the same prediction from the range image result. This leads to inferior per-point segmentation performance. (c) Theoretical upper bounds of mIoU ($\%$) scores under various sizes of range images on SemanticKITTI semantickitti_2019_behley, SemanticPOSS semanticposs_2020 and nuScenes nuscenes_panoptic datasets. Specifically, under the sizes of $64\times2048$, $40\times1800$, and $32\times1088$, a range image-based model achieves at most 97.96%, 99.13%, and 91.53% mIoU scores on the three datasets, respectively.
  • Figure 2: Overview of the framework. Scan unfolding++ filling_missing2024 is used to project points onto the range image. Then, the range image goes through a range image-based model to produce the feature map and make per-pixel predictions. Finally, the proposed trainable pointwise decoder module (PDM) is utilized to make per-point predictions.
  • Figure 3: Range image-guided $K$NN search in PDM. According to the look-up table (LUT), a point $\boldsymbol{p}_i$ can first find its $\left(u, v\right)$ coordinate in "Range Image" and "Feature Map" (indicated by the dash red circles). Then, the point can efficiently search the neighbor points $\left\{\boldsymbol{p}_1, \boldsymbol{p}_2, \dots, \boldsymbol{p}_9\right\}$ and corresponding features $\left\{\boldsymbol{f}_1, \boldsymbol{f}_2, \dots, \boldsymbol{f}_9\right\}$. Next, based on "Relative Distances", "Top-$K$ Indices" can be calculated. Finally, the indices are employed to choose $K$ nearest neighbor points $\left\{\boldsymbol{p}_1, \boldsymbol{p}_2, \dots, \boldsymbol{p}_5\right\}$ and corresponding feature vectors $\left\{\boldsymbol{f}_1, \boldsymbol{f}_2, \dots, \boldsymbol{f}_5\right\}$.
  • Figure 4: Local feature extraction module in PDM. "Relative Positions" $\left\{\triangle\boldsymbol{p}_{i1}, \triangle\boldsymbol{p}_{i2}, \dots, \triangle\boldsymbol{p}_{i5}\right\}$ and "Edge Features" $\left\{\triangle\boldsymbol{f}_{i1}, \triangle\boldsymbol{f}_{i2}, \dots, \triangle\boldsymbol{f}_{i5}\right\}$ are used to calculate attentive weights $\left\{\boldsymbol{w}_1, \boldsymbol{w}_2, \dots, \boldsymbol{w}5\right\}$. "Feature Vectors" $\left\{\boldsymbol{f}_1, \boldsymbol{f}_2, \dots, \boldsymbol{f}_5\right\}$ and "Relative Positions" are fused to generate the features $\left\{\boldsymbol{\delta}_1, \boldsymbol{\delta}_2, \dots, \boldsymbol{\delta}_5\right\}$. Finally, the summation (SUM) operation is adopted to aggregate the weighted features, and a classifier is utilized to make per-point predictions.
  • Figure 5: Projection of points onto the virtual range image. The points from the same laser are sequentially projected onto the same scan line. If two or more projected points occupy one virtual range image position, all points are kept. No points are dropped.
  • ...and 6 more figures