Towards Semi-supervised Dual-modal Semantic Segmentation
Qiulei Dong, Jianan Li, Shuang Deng
TL;DR
PD-Net tackles semi-supervised dual-modal semantic segmentation by coupling 3D point clouds and 2D images through two parallel streams, one for standard supervision and another for pseudo-label generation. It introduces a multi-scale dual-modal fusion module with an attention-based mechanism, a consistency loss to align 3D and 2D features, and a non-parametric pseudo-label optimization to refine unlabeled data labels. Empirical results on ScanNet (and NYUv2) show PD-Net consistently outperforms semi-supervised uni-modal baselines and achieves competitive performance with fully-supervised dual-modal methods, even with limited labeled data. The approach offers a practical solution for exploiting abundant unlabeled dual-modal data in real-world 3D perception tasks.
Abstract
With the development of 3D and 2D data acquisition techniques, it has become easy to obtain point clouds and images of scenes simultaneously, which further facilitates dual-modal semantic segmentation. Most existing methods for simultaneously segmenting point clouds and images rely heavily on the quantity and quality of the labeled training data. However, massive point-wise and pixel-wise labeling procedures are time-consuming and labor-intensive. To address this issue, we propose a parallel dual-stream network to handle the semi-supervised dual-modal semantic segmentation task, called PD-Net, by jointly utilizing a small number of labeled point clouds, a large number of unlabeled point clouds, and unlabeled images. The proposed PD-Net consists of two parallel streams (called original stream and pseudo-label prediction stream). The pseudo-label prediction stream predicts the pseudo labels of unlabeled point clouds and their corresponding images. Then, the unlabeled data is sent to the original stream for self-training. Each stream contains two encoder-decoder branches for 3D and 2D data respectively. In each stream, multiple dual-modal fusion modules are explored for fusing the dual-modal features. In addition, a pseudo-label optimization module is explored to optimize the pseudo labels output by the pseudo-label prediction stream. Experimental results on two public datasets demonstrate that the proposed PD-Net not only outperforms the comparative semi-supervised methods but also achieves competitive performances with some fully-supervised methods in most cases.
