Table of Contents
Fetching ...

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

Lei Yao, Yi Wang, Moyun Liu, Lap-Pui Chau

TL;DR

SGIFormer tackles 3D point-cloud instance segmentation by addressing two core issues: initialization of instance queries and preservation of geometric details in deep decoding. It introduces a semantic-guided mix query (SMQ) to generate scene-aware queries from voxel-wise semantic predictions and combines them with learnable queries, forming a diverse and informed query set. A geometric-enhanced interleaving transformer (GIT) decoder then refines instance queries and global scene features in an alternating fashion, aided by a bias-based geometric auxiliary task and shifted coordinate embeddings. Across ScanNet V2, ScanNet200, and ScanNet++, SGIFormer achieves state-of-the-art performance with competitive efficiency, with ablations confirming the effectiveness of semantic guidance, geometric refinement, and interleaving updates for preserving fine-grained details in large-scale scenes.

Abstract

In recent years, transformer-based models have exhibited considerable potential in point cloud instance segmentation. Despite the promising performance achieved by existing methods, they encounter challenges such as instance query initialization problems and excessive reliance on stacked layers, rendering them incompatible with large-scale 3D scenes. This paper introduces a novel method, named SGIFormer, for 3D instance segmentation, which is composed of the Semantic-guided Mix Query (SMQ) initialization and the Geometric-enhanced Interleaving Transformer (GIT) decoder. Specifically, the principle of our SMQ initialization scheme is to leverage the predicted voxel-wise semantic information to implicitly generate the scene-aware query, yielding adequate scene prior and compensating for the learnable query set. Subsequently, we feed the formed overall query into our GIT decoder to alternately refine instance query and global scene features for further capturing fine-grained information and reducing complex design intricacies simultaneously. To emphasize geometric property, we consider bias estimation as an auxiliary task and progressively integrate shifted point coordinates embedding to reinforce instance localization. SGIFormer attains state-of-the-art performance on ScanNet V2, ScanNet200 datasets, and the challenging high-fidelity ScanNet++ benchmark, striking a balance between accuracy and efficiency. The code, weights, and demo videos are publicly available at https://rayyoh.github.io/sgiformer.

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation

TL;DR

SGIFormer tackles 3D point-cloud instance segmentation by addressing two core issues: initialization of instance queries and preservation of geometric details in deep decoding. It introduces a semantic-guided mix query (SMQ) to generate scene-aware queries from voxel-wise semantic predictions and combines them with learnable queries, forming a diverse and informed query set. A geometric-enhanced interleaving transformer (GIT) decoder then refines instance queries and global scene features in an alternating fashion, aided by a bias-based geometric auxiliary task and shifted coordinate embeddings. Across ScanNet V2, ScanNet200, and ScanNet++, SGIFormer achieves state-of-the-art performance with competitive efficiency, with ablations confirming the effectiveness of semantic guidance, geometric refinement, and interleaving updates for preserving fine-grained details in large-scale scenes.

Abstract

In recent years, transformer-based models have exhibited considerable potential in point cloud instance segmentation. Despite the promising performance achieved by existing methods, they encounter challenges such as instance query initialization problems and excessive reliance on stacked layers, rendering them incompatible with large-scale 3D scenes. This paper introduces a novel method, named SGIFormer, for 3D instance segmentation, which is composed of the Semantic-guided Mix Query (SMQ) initialization and the Geometric-enhanced Interleaving Transformer (GIT) decoder. Specifically, the principle of our SMQ initialization scheme is to leverage the predicted voxel-wise semantic information to implicitly generate the scene-aware query, yielding adequate scene prior and compensating for the learnable query set. Subsequently, we feed the formed overall query into our GIT decoder to alternately refine instance query and global scene features for further capturing fine-grained information and reducing complex design intricacies simultaneously. To emphasize geometric property, we consider bias estimation as an auxiliary task and progressively integrate shifted point coordinates embedding to reinforce instance localization. SGIFormer attains state-of-the-art performance on ScanNet V2, ScanNet200 datasets, and the challenging high-fidelity ScanNet++ benchmark, striking a balance between accuracy and efficiency. The code, weights, and demo videos are publicly available at https://rayyoh.github.io/sgiformer.
Paper Structure (17 sections, 13 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 17 sections, 13 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: Performance evaluation of our proposed SGIFormer. (a) We present the performance comparison of various methods based on AP$_{50}$ and model size on ScanNet V2 dai2017scannet validation split. SGIFormer outperforms previous methods, and even the smaller version achieves competitive results. (b) We showcase the fine-grained segmentation results of SGIFormer on ScanNet++ yeshwanth2023scannet++ validation set, demonstrating its ability to segment small objects within large-scale scenes accurately.
  • Figure 2: Overall architecture of SGIFormer. The method comprises three main components: a symmetrical U-Net backbone, a Semantic-guided Mix Query (SMQ) initialization scheme, and a Geometric-enhanced Interleaving Transformer (GIT) decoder. Aug. in this figure denotes data augmentation. The 3D backbone extracts voxel-wise global features ${\mathbf{F}}$ from the input point cloud ${\mathcal{P}}$ (Sec. \ref{['subsec:backbone']}). SMQ constructs instance queries ${\mathcal{Q}}$ with semantic guidance (Sec. \ref{['subsec:query']}). GIT alternately refines the queries and scene features to enhance geometric information and capture fine-grained details. The final instance masks ${\mathcal{M}}$ and categories $\boldsymbol{p}$ are predicted by the decoder (Sec. \ref{['subsec:decoder']}).
  • Figure 3: Geometric-enhanced Interleaving Transformer (GIT) decoder. The diagram illustrates the detailed structure of our designed decoder. The decoder consists of $L$ layers and employs an alternating update scheme to capture fine-grained features. In each layer, the instance queries ${\mathcal{Q}}$, and scene features ${\mathbf{F}}_{\texttt{s}}$ are iteratively refined by incorporating shifted coordinates embedding ${\mathbf{E}}_{\texttt{s}}$. The refined instance queries are then utilized to predict masks ${\mathcal{M}}$ and categories $\boldsymbol{p}$.
  • Figure 4: Visualization comparison on ScanNet V2 validation split. We visualize the instance segmentation results of SGIFormer (ours), SPFormer sun2023spformer, and Spherical Mask shin2023spherical. Inst. GT means instance ground truth, and different colors indicate different instance IDs. The comparison with SPFormer sun2023spformer is highlighted in red, while the comparison with Spherical Mask shin2023spherical is highlighted in blue.
  • Figure 5: Qualitative results of ScanNet++ validation set. We present 4 representative examples selected from ScanNet++ validation set to showcase the input point clouds, instance ground truth, and the segmentation results of SGIFormer. The visualization comprehensively illustrates our method's capability in handling large-scale and high-fidelity scenes.