Bayesian Self-Training for Semi-Supervised 3D Segmentation
Ozan Unal, Christos Sakaridis, Luc Van Gool
TL;DR
This paper introduces Bayesian self-training for semi-supervised 3D perception, leveraging dropout-based Monte Carlo inference to estimate predictive uncertainty and filter pseudo-labels with entropy. It unifies semantic segmentation, instance segmentation, and dense 3D visual grounding under a common framework, employing a novel n-partite matching strategy (Hungarian) to align instance predictions across stochastic passes. The approach achieves state-of-the-art results on SemanticKITTI, ScribbleKITTI, ScanNet, S3DIS, and ScanRefer, including gains when verbal prompts are available for unlabeled data. The method is simple to implement, scalable, and broadly applicable to dense 3D tasks with partial labels, offering practical impact for real-world 3D understanding systems.
Abstract
3D segmentation is a core problem in computer vision and, similarly to many other dense prediction tasks, it requires large amounts of annotated data for adequate training. However, densely labeling 3D point clouds to employ fully-supervised training remains too labor intensive and expensive. Semi-supervised training provides a more practical alternative, where only a small set of labeled data is given, accompanied by a larger unlabeled set. This area thus studies the effective use of unlabeled data to reduce the performance gap that arises due to the lack of annotations. In this work, inspired by Bayesian deep learning, we first propose a Bayesian self-training framework for semi-supervised 3D semantic segmentation. Employing stochastic inference, we generate an initial set of pseudo-labels and then filter these based on estimated point-wise uncertainty. By constructing a heuristic $n$-partite matching algorithm, we extend the method to semi-supervised 3D instance segmentation, and finally, with the same building blocks, to dense 3D visual grounding. We demonstrate state-of-the-art results for our semi-supervised method on SemanticKITTI and ScribbleKITTI for 3D semantic segmentation and on ScanNet and S3DIS for 3D instance segmentation. We further achieve substantial improvements in dense 3D visual grounding over supervised-only baselines on ScanRefer. Our project page is available at ouenal.github.io/bst/.
