Table of Contents
Fetching ...

DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation

Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang

TL;DR

This work tackles weakly supervised 3D semantic instance segmentation using only scene-level labels. It proposes DBGroup, a Dual-Branch Point Grouping framework that fuses semantic cues from a vision-language model with mask cues from SAM prompts to generate pseudo instance and semantic labels. Two refinement modules, Granularity Aware Instance Merging and Semantic Selection and Propagation, plus an Instance Mask Filter, produce high-quality pseudo labels, which are used in a multi-round self-training loop. Experiments on ScanNetV2 and S3DIS show competitive performance with sparse or scene-level supervision, outperforming scene-level semantic baselines and approaching point-level methods, indicating scalable annotation with strong practical impact.

Abstract

Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation

TL;DR

This work tackles weakly supervised 3D semantic instance segmentation using only scene-level labels. It proposes DBGroup, a Dual-Branch Point Grouping framework that fuses semantic cues from a vision-language model with mask cues from SAM prompts to generate pseudo instance and semantic labels. Two refinement modules, Granularity Aware Instance Merging and Semantic Selection and Propagation, plus an Instance Mask Filter, produce high-quality pseudo labels, which are used in a multi-round self-training loop. Experiments on ScanNetV2 and S3DIS show competitive performance with sparse or scene-level supervision, outperforming scene-level semantic baselines and approaching point-level methods, indicating scalable annotation with strong practical impact.

Abstract

Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

Paper Structure

This paper contains 20 sections, 6 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison between conventional weak annotation formats and our proposed scene-level annotation in 3D instance segmentation task.
  • Figure 2: Overview of our weakly supervised 3D segmentation pipeline. It includes a Pseudo Label Generation and Refinement module, as well as a multi-round self-training 3D instance segmentation network.
  • Figure 3: The workflow of our designed Pseudo Label Generation and Refinement. We propose a dual-branch point grouping architecture and two pseudo label refinement strategies. The SGB extracts features from multi-view images via a pre-trained 2D model, projects them into 3D point clouds, computes similarities using text encoder features, and applies BFS clustering to generate coarse-grained masks. The MGB projects superpoints onto multi-view images as SAM prompts, creating masks that group 3D points into fine-grained masks. GAIM merges or splits masks for instance pseudo labels, while SSP filters semantic scores for semantic pseudo labels.
  • Figure 4: Framework of our 3D instance segmentation network. Following a grouping-based paradigm, we use a 3D U-Net for feature extraction with semantic and offset branches to predict semantic labels and offsets. After clustering, an Instance Mask Filter removes irrelevant points from proposals, with final predictions obtained via a scoring network and non-maximum suppression (NMS).
  • Figure 5: Qualitative ablation results of SGB, MGB and GAIM. From left to right: input point clouds, ground truth, setting (a) to (d) in Tab.\ref{['tab:ablation_GAIM']}.