Table of Contents
Fetching ...

Towards Semi-supervised Dual-modal Semantic Segmentation

Qiulei Dong, Jianan Li, Shuang Deng

TL;DR

PD-Net tackles semi-supervised dual-modal semantic segmentation by coupling 3D point clouds and 2D images through two parallel streams, one for standard supervision and another for pseudo-label generation. It introduces a multi-scale dual-modal fusion module with an attention-based mechanism, a consistency loss to align 3D and 2D features, and a non-parametric pseudo-label optimization to refine unlabeled data labels. Empirical results on ScanNet (and NYUv2) show PD-Net consistently outperforms semi-supervised uni-modal baselines and achieves competitive performance with fully-supervised dual-modal methods, even with limited labeled data. The approach offers a practical solution for exploiting abundant unlabeled dual-modal data in real-world 3D perception tasks.

Abstract

With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual-modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point-wise and pixel-wise labeling procedures are time-consuming and labor-intensive. To address this issue, we propose a parallel dual-stream network to handle the semi-supervised dual-modal semantic segmentation task, called PD-Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD-Net consists of two parallel streams (called original stream and pseudo-label prediction stream). The pseudo-label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self-training. Each stream contains two encoder-decoder branches for 3D and 2D data respectively. In each stream, multiple dual-modal fusion modules are explored for fusing the dual-modal features. In addition, a pseudo-label optimization module is explored to optimize the pseudo labels output by the pseudo-label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD-Net not only outperforms the comparative semi-supervised methods but also achieves competitive performances with some fully-supervised methods in most cases.

Towards Semi-supervised Dual-modal Semantic Segmentation

TL;DR

PD-Net tackles semi-supervised dual-modal semantic segmentation by coupling 3D point clouds and 2D images through two parallel streams, one for standard supervision and another for pseudo-label generation. It introduces a multi-scale dual-modal fusion module with an attention-based mechanism, a consistency loss to align 3D and 2D features, and a non-parametric pseudo-label optimization to refine unlabeled data labels. Empirical results on ScanNet (and NYUv2) show PD-Net consistently outperforms semi-supervised uni-modal baselines and achieves competitive performance with fully-supervised dual-modal methods, even with limited labeled data. The approach offers a practical solution for exploiting abundant unlabeled dual-modal data in real-world 3D perception tasks.

Abstract

With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual-modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point-wise and pixel-wise labeling procedures are time-consuming and labor-intensive. To address this issue, we propose a parallel dual-stream network to handle the semi-supervised dual-modal semantic segmentation task, called PD-Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD-Net consists of two parallel streams (called original stream and pseudo-label prediction stream). The pseudo-label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self-training. Each stream contains two encoder-decoder branches for 3D and 2D data respectively. In each stream, multiple dual-modal fusion modules are explored for fusing the dual-modal features. In addition, a pseudo-label optimization module is explored to optimize the pseudo labels output by the pseudo-label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD-Net not only outperforms the comparative semi-supervised methods but also achieves competitive performances with some fully-supervised methods in most cases.
Paper Structure (18 sections, 7 equations, 6 figures, 9 tables)

This paper contains 18 sections, 7 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Diagrams of the fully-supervised dual-modal segmentation (top-left), semi-supervised uni-modal segmentation (bottom-left), and our semi-supervised dual-modal segmentation (right). In each framework, boxes in the same color represent they are corresponding with each other (i.e., point cloud with image / input with output). 'Supervised' represents the output is supervised by the ground truth.
  • Figure 2: Architecture of the proposed PD-Net and the original stream / the pseudo-label prediction stream in PD-Net. The proposed PD-Net contains an original stream and a pseudo-label prediction stream. The Pseudo-label Optimization (PLO) module is utilized to optimize the pseudo labels output by the pseudo-label prediction stream. CE Loss represents the cross entropy loss. The labeled point clouds and their corresponding images are only trained in the original stream, while the unlabeled point clouds and their corresponding images are trained in both two streams. The stream in PD-Net contains 3D and 2D encoder-decoder branches for dual-modal data, and multiple Dual-modal Fusion (DMF) modules to fuse the dual-modal latent features. The consistency loss function is utilized to constrain the dual-modal output features in the original stream.
  • Figure 3: The calculation process of the 3D fused feature $\boldsymbol{g}(p_i)$ in the dual-modal fusion module. The dimensions of the key feature $K(\cdot)$, query feature $Q(\cdot)$, and value feature $V(\cdot)$ are the results of dividing the dimensions of their corresponding latent feature $\boldsymbol{f}(\cdot)$ by the head number $H$. $d_1$ and $d_2$ denote the dimensions of 3D features and 2D features respectively. The attention-based mechanism in the dual-modal fusion module facilitates adaptively learning complementary information from dual-modal data.
  • Figure 4: The optimization process of 3D (top) and 2D (bottom) pseudo-labels. The coarse 2D pseudo labels are projected to point clouds to obtain the projected 3D pseudo labels. The coarse 3D pseudo labels are densified after being projected to the image plane to obtain the projected 2D pseudo labels. The black point denotes the pseudo label that is deleted by the pseudo-label optimization module.
  • Figure 5: Qualitative results of point cloud segmentation on the validation set of the ScanNet 2017scannet. The segmentation results of the baseline model (MinkowskiNet18A 2019minkowskinet and ResNet34 2016resnet) and our proposed PD-Net in two different labeled-ratio settings (20% and 10%) are visualized.
  • ...and 1 more figures