BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation

Jiahao Lu; Jiacheng Deng; Tianzhu Zhang

BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation

Jiahao Lu, Jiacheng Deng, Tianzhu Zhang

TL;DR

This work tackles the challenge of 3D instance segmentation when only bounding box annotations are available, addressing label ambiguity in overlapping regions. It introduces SAFormer, a Simulation-assisted Mean Teacher pseudo-labeler equipped with a Local-Global Aware Attention decoder, and leverages simulated overlapping samples to bootstrap learning. By employing a Mean Teacher framework and soft supervision, BSNet achieves high-quality pseudo-labels for overlaps and trains a 3D DIS network effectively, yielding state-of-the-art results on ScanNetV2 and S3DIS while remaining computationally efficient. The approach significantly narrows the gap to fully supervised performance and offers a practical path for box-supervised 3D scene understanding in real-world applications.

Abstract

3D instance segmentation (3DIS) is a crucial task, but point-level annotations are tedious in fully supervised settings. Thus, using bounding boxes (bboxes) as annotations has shown great potential. The current mainstream approach is a two-step process, involving the generation of pseudo-labels from box annotations and the training of a 3DIS network with the pseudo-labels. However, due to the presence of intersections among bboxes, not every point has a determined instance label, especially in overlapping areas. To generate higher quality pseudo-labels and achieve more precise weakly supervised 3DIS results, we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet), which devises a novel pseudo-labeler called Simulation-assisted Transformer. The labeler consists of two main components. The first is Simulation-assisted Mean Teacher, which introduces Mean Teacher for the first time in this task and constructs simulated samples to assist the labeler in acquiring prior knowledge about overlapping areas. To better model local-global structure, we also propose Local-Global Aware Attention as the decoder for teacher and student labelers. Extensive experiments conducted on the ScanNetV2 and S3DIS datasets verify the superiority of our designs. Code is available at \href{https://github.com/peoplelu/BSNet}{https://github.com/peoplelu/BSNet}.

BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 6 figures, 10 tables)

This paper contains 16 sections, 11 equations, 6 figures, 10 tables.

Introduction
Related Work
Method
Overview
Process to Generate SAFormer
Simulated Sample Generation
Local-Global Aware Attention
Mean Teacher Approach
Training a 3DIS Network
Experiments
Experimental Setup
Comparison with state-of-the-art methods
Ablation Study
Parameters and Training Time Analysis
Conclusion
...and 1 more sections

Figures (6)

Figure 1: (a) The visualization of an overlapping sample. (b) The proposed Simulation-assisted Mean Teacher helps the labeler acquire prior knowledge from simulated samples. (c) Our method improves local-global structure modeling of overlapping samples to generate better pseudo-labels (especially in yellow circles).
Figure 2: (i@) The overall framework of our method BSNet. (ii@) The total process to generate an outstanding pseudo-labeler SAFormer. BSNet is a novel two-step method consisting of generating pseudo instance labels by SPFormer and using the pseudo instance labels to train a 3D instance segmentation network.
Figure 3: The process of generating simulated samples. There are numerous non-overlapping objects ($P$) with definite instance labels in real scenes. We can generate simulated samples ($S$) based on the distribution of real overlapping samples ($O$) and the physical plausibility.
Figure 4: The Local-Global Aware Attention. Two foreground queries are input into local-structure attention and global-context attention to generate corresponding masks. $S_1, S_2$ represent non-overlapping areas. $S_3$ represents overlapping areas.
Figure 5: Qualitative results on ScanNetV2 training set. Our approach produces highly accurate pseudo instance masks, particularly in overlapping areas (yellow circles).
...and 1 more figures

BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation

TL;DR

Abstract

BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)