Table of Contents
Fetching ...

Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

Shuyu Cao, Chongshou Li, Jie Xu, Tianrui Li, Na Zhao

TL;DR

This work addresses 3D hierarchical semantic segmentation (3DHS) by tackling two persistent challenges: cross-hierarchy conflicts when using a shared parameter backbone and inherent class imbalance across hierarchy levels. It proposes a Late-decoupled 3DHS (Ld-3DHS) framework that uses separate decoders per hierarchy fed by a shared encoder, augmented with a coarse-to-fine guidance mechanism and a cross-hierarchical consistency loss. An auxiliary discrimination branch learns class-wise discriminative features via supervised contrastive learning and mutual semantic-prototype supervision, forming a total objective $\mathcal{L}_{total}=\mathcal{L}_{late-3DHS}+\lambda\sum_{h=1}^H\mathcal{L}_{aux}^{(h)}$ that improves minority-class segmentation. Experiments on Campus3D, S3DIS-H, and SensatUrban-H demonstrate state-of-the-art performance across backbones and datasets, and the approach provides a plug-and-play enhancement to existing 3DHS methods with broader practical impact for embodied intelligence tasks.

Abstract

3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

Late-decoupled 3D Hierarchical Semantic Segmentation with Semantic Prototype Discrimination based Bi-branch Supervision

TL;DR

This work addresses 3D hierarchical semantic segmentation (3DHS) by tackling two persistent challenges: cross-hierarchy conflicts when using a shared parameter backbone and inherent class imbalance across hierarchy levels. It proposes a Late-decoupled 3DHS (Ld-3DHS) framework that uses separate decoders per hierarchy fed by a shared encoder, augmented with a coarse-to-fine guidance mechanism and a cross-hierarchical consistency loss. An auxiliary discrimination branch learns class-wise discriminative features via supervised contrastive learning and mutual semantic-prototype supervision, forming a total objective that improves minority-class segmentation. Experiments on Campus3D, S3DIS-H, and SensatUrban-H demonstrate state-of-the-art performance across backbones and datasets, and the approach provides a plug-and-play enhancement to existing 3DHS methods with broader practical impact for embodied intelligence tasks.

Abstract

3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence applications that demand a multi-grained and multi-hierarchy understanding of 3D scenes. Despite the progress, previous 3DHS methods have overlooked following two challenges: I) multi-label learning with a parameter-sharing model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, which makes the model performance become dominated by major classes. To address these issues, we propose a novel framework with a primary 3DHS branch and an auxiliary discrimination branch. Specifically, to alleviate the multi-hierarchy conflicts, we propose a late-decoupled 3DHS framework which employs multiple decoders with the coarse-to-fine hierarchical guidance and consistency. The late-decoupled architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and can also constrain the class imbalance problem in each individual hierarchy. Moreover, we introduce a 3DHS-oriented semantic prototype based bi-branch supervision mechanism, which additionally learns class-wise discriminative point cloud features and performs mutual supervision between the auxiliary and 3DHS branches, to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach achieves state-of-the-art 3DHS performance, and its core components can also be used as a plug-and-play enhancement to improve previous methods.

Paper Structure

This paper contains 15 sections, 16 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Overview of Ld-3DHS. (a) The late-decoupled 3DHS branch uses a shared point cloud encoder with multiple late-decoupled decoders to perform hierarchical segmentation tasks, and leverages a coarse-to-fine guidance mechanism to balance task-specific learning and hierarchical consistency. Meanwhile, (b) the auxiliary discrimination branch introduces a novel semantic-prototype-based bi-branch supervision scheme, which employs contrastive learning to learn discriminative representations for individual classes, and then provides bi-directional supervision with the above 3DHS branch, thereby improving the segmentation ability on class-imbalance point clouds.
  • Figure 2: Per-class segmentation performance comparison of three methods on S3DIS-H dataset. The mIoU values of the L1 hierarchy on S3DIS-H dataset are reported, and the black line records the log-scale point number for individual classes.
  • Figure 3: Time cost comparison of three 3DHS methods.
  • Figure 4: Case comparison of three 3D hierarchical semantic segmentation methods (MTHS li2020campus3d, DHL li2025deep_hierarchical_learning, and our Ld-3DHS) on testing samples of S3DIS-H. L0 and L1 represent the two semantic hierarchies in the indoor point cloud scenes.
  • Figure 5: Hyper-parameter analysis for (a) coarse-to-fine guidance weight $\alpha$ and (b) auxiliary discrimination branch loss weight $\lambda$.