Table of Contents
Fetching ...

Instance Consistency Regularization for Semi-Supervised 3D Instance Segmentation

Yizheng Wu, Zhiyu Pan, Kewei Wang, Xingyi Li, Jiahao Cui, Liwen Xiao, Guosheng Lin, Zhiguo Cao

TL;DR

This work tackles the expensive labeling burden and semantic pseudo-label noise in semi-supervised 3D instance segmentation by proposing InsTeacher3D, a framework that leverages pure instance consistency. It combines a parallel kernel-based base model, DKNet, with an instance-consistency regularization module that discards semantic pseudo labels and uses a dynamic mask generation process to produce high-quality instance pseudo labels under a mean-teacher EMA scheme. The approach yields state-of-the-art results on ScanNetV2, S3DIS, and STPLS3D across various annotation rates, highlighting the value of instance-centric learning for robust semi-supervised 3D segmentation. Overall, the study demonstrates that focusing on cohesive instance knowledge, rather than noisy semantic predictions, substantially improves learning from unlabeled data and enables strong performance in both indoor and outdoor scenes, with practical implications for scalable 3D scene understanding. $\alpha=0.999$ in the EMA update is found to provide stable pseudo-labels and performance gains.

Abstract

Large-scale datasets with point-wise semantic and instance labels are crucial to 3D instance segmentation but also expensive. To leverage unlabeled data, previous semi-supervised 3D instance segmentation approaches have explored self-training frameworks, which rely on high-quality pseudo labels for consistency regularization. They intuitively utilize both instance and semantic pseudo labels in a joint learning manner. However, semantic pseudo labels contain numerous noise derived from the imbalanced category distribution and natural confusion of similar but distinct categories, which leads to severe collapses in self-training. Motivated by the observation that 3D instances are non-overlapping and spatially separable, we ask whether we can solely rely on instance consistency regularization for improved semi-supervised segmentation. To this end, we propose a novel self-training network InsTeacher3D to explore and exploit pure instance knowledge from unlabeled data. We first build a parallel base 3D instance segmentation model DKNet, which distinguishes each instance from the others via discriminative instance kernels without reliance on semantic segmentation. Based on DKNet, we further design a novel instance consistency regularization framework to generate and leverage high-quality instance pseudo labels. Experimental results on multiple large-scale datasets show that the InsTeacher3D significantly outperforms prior state-of-the-art semi-supervised approaches. Code is available: https://github.com/W1zheng/InsTeacher3D.

Instance Consistency Regularization for Semi-Supervised 3D Instance Segmentation

TL;DR

This work tackles the expensive labeling burden and semantic pseudo-label noise in semi-supervised 3D instance segmentation by proposing InsTeacher3D, a framework that leverages pure instance consistency. It combines a parallel kernel-based base model, DKNet, with an instance-consistency regularization module that discards semantic pseudo labels and uses a dynamic mask generation process to produce high-quality instance pseudo labels under a mean-teacher EMA scheme. The approach yields state-of-the-art results on ScanNetV2, S3DIS, and STPLS3D across various annotation rates, highlighting the value of instance-centric learning for robust semi-supervised 3D segmentation. Overall, the study demonstrates that focusing on cohesive instance knowledge, rather than noisy semantic predictions, substantially improves learning from unlabeled data and enables strong performance in both indoor and outdoor scenes, with practical implications for scalable 3D scene understanding. in the EMA update is found to provide stable pseudo-labels and performance gains.

Abstract

Large-scale datasets with point-wise semantic and instance labels are crucial to 3D instance segmentation but also expensive. To leverage unlabeled data, previous semi-supervised 3D instance segmentation approaches have explored self-training frameworks, which rely on high-quality pseudo labels for consistency regularization. They intuitively utilize both instance and semantic pseudo labels in a joint learning manner. However, semantic pseudo labels contain numerous noise derived from the imbalanced category distribution and natural confusion of similar but distinct categories, which leads to severe collapses in self-training. Motivated by the observation that 3D instances are non-overlapping and spatially separable, we ask whether we can solely rely on instance consistency regularization for improved semi-supervised segmentation. To this end, we propose a novel self-training network InsTeacher3D to explore and exploit pure instance knowledge from unlabeled data. We first build a parallel base 3D instance segmentation model DKNet, which distinguishes each instance from the others via discriminative instance kernels without reliance on semantic segmentation. Based on DKNet, we further design a novel instance consistency regularization framework to generate and leverage high-quality instance pseudo labels. Experimental results on multiple large-scale datasets show that the InsTeacher3D significantly outperforms prior state-of-the-art semi-supervised approaches. Code is available: https://github.com/W1zheng/InsTeacher3D.
Paper Structure (32 sections, 11 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 32 sections, 11 equations, 10 figures, 13 tables, 1 algorithm.

Figures (10)

  • Figure 1: The comparison between semantic and instance pseudo labels. (a) An armchair combines features of both a sofa and a chair. This confusion, along with the imbalance in semantic categories, results in poor semantic pseudo labels. In contrast, the non-overlapping and spatially separable nature of 3D instances leads to accurate instance pseudo labels. We present the confidence scores within parentheses. (b) The semantic pseudo labels may shatter the instances into multiple semantic parts, which causes ambiguity during the self-training procedure. (c) Instance pseudo labels are sharp enough and keep instances as cohesive units, which benefits learning from unlabeled data.
  • Figure 2: The pipeline of InsTeacher3D. InsTeacher3D is a self-training network in instance consistency regularization framework, where DKNet is the base segmentation model. "DMG" denotes the dynamic mask generation module, serving as a key module in instance consistency regularization to generate high-quality instance pseudo labels.
  • Figure 3: Different architectures for consistency regularization of semi-supervised 3D instance segmentation network. Here, we focus on the pivotal aspects of consistency regularization on unlabeled data for clarity. The joint consistency regularization (JCR) framework can cooperate with serial and parallel base segmentation models in (a) and (b), respectively. "I" and "S" represent the instance and semantic segmentation modules. The teacher and student models are denoted by superscripts "t" and "s". "Ins." and "Sem." are instance and semantic pseudo labels respectively. The red and blue colors represent the knowledge of semantic and instance segmentation. Best viewed in color.
  • Figure 4: The performance of different instance segmentation architectures and the training curves on S3DIS. We report the performance of semantic and instance segmentation performance in (a) and (b). HAIS hais and DKNet dknet are adopted as serial and parallel base models, respectively. $100\%$ and $20\%$ denote the two supervised settings, where models are trained with different amounts of labeled data. "LR ($20\%$)" is the semi-supervised setting where we train the model with labels from only $20\%$ scenes. The curves in (c) and (d) represent the loss terms on S3DIS validation sets with joint and instance consistency regularization frameworks, respectively.
  • Figure 5: A detailed procedure for the instance self-enhancement module. The simply projected masks contain numerous noises, which will lead to unstable self-training. The proposed instance self-enhancement module effectively improves the quality of pseudo labels $\hat{M}$.
  • ...and 5 more figures