Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult; Francis Engelmann; Alexander Hermans; Or Litany; Siyu Tang; Bastian Leibe

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, Bastian Leibe

TL;DR

Mask3D introduces a Transformer-based approach for 3D semantic instance segmentation, addressing limitations of voting and hand-tuned clustering. It learns instance queries that attend to multi-scale point features, predicting all instance masks in parallel via a mask module and a bipartite matching loss. The method achieves state-of-the-art results across ScanNet, ScanNet200, S3DIS, and STPLS3D, with ablations showing effectiveness of non-parametric queries, Dice+BCE loss, and query-depth. It demonstrates the viability of Transformer-based architectures for 3D instance segmentation and reduces reliance on hand-crafted geometric priors.

Abstract

Modern 3D semantic instance segmentation approaches predominantly rely on specialized voting mechanisms followed by carefully designed geometric clustering techniques. Building on the successes of recent Transformer-based methods for object detection and image segmentation, we propose the first Transformer-based approach for 3D semantic instance segmentation. We show that we can leverage generic Transformer building blocks to directly predict instance masks from 3D point clouds. In our model called Mask3D each object instance is represented as an instance query. Using Transformer decoders, the instance queries are learned by iteratively attending to point cloud features at multiple scales. Combined with point features, the instance queries directly yield all instance masks in parallel. Mask3D has several advantages over current state-of-the-art approaches, since it neither relies on (1) voting schemes which require hand-selected geometric properties (such as centers) nor (2) geometric grouping mechanisms requiring manually-tuned hyper-parameters (e.g. radii) and (3) enables a loss that directly optimizes instance masks. Mask3D sets a new state-of-the-art on ScanNet test (+6.2 mAP), S3DIS 6-fold (+10.1 mAP), STPLS3D (+11.2 mAP) and ScanNet200 test (+12.4 mAP).

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

TL;DR

Abstract

Paper Structure (11 sections, 7 equations, 7 figures, 7 tables)

This paper contains 11 sections, 7 equations, 7 figures, 7 tables.

Introduction
Related Work
Method
Training and Implementation Details
Experiments
Comparing with State-of-the-Art Methods
Analysis Experiments
Qualitative Results and Limitations
Conclusion
Implementation Details
Comparison to SoftGroup

Figures (7)

Figure 1: Mask3D. We train an end-to-end model for 3D semantic instance segmentation on point clouds. Given an input 3D point cloud (left), our Transformer-based model uses an attention mechanism to produce instance heatmaps across all points (center) and directly predicts all semantic object instances in parallel (right).
Figure 2: Illustration of the Mask3D model. The feature backbone outputs multi-scale point features $\mathbf{F}$, while the Transformer decoder iteratively refines the instance queries $\mathbf{X}$. Given point features and instance queries, the mask module predicts for each query a semantic class and an instance heatmap, which (after thresholding) results in a binary instance mask $\mathbf{B}$. $\tau$ applies a threshold of 0.5 and spatially rescales if required. $\cdot$ is the dot product. $\sigma$ is the sigmoid function. We show a simplified model with fewer layers.
Figure 3: Number of queries and decoder layers.
Figure 4: Qualitative Results on ScanNet. We show pairs of predicted instance masks and predicted semantic labels. On the bottom left, we show the heatmap of a failure case of two windows that are wrongly assigned to a single instance. The corresponding point features are visualized as RGB after projecting them to 3D using PCA.
Figure 5: Illustration of the full Mask3D model. In the main paper, we showed a simplified version of our model with fewer hierarchical feature levels in the feature backbone (shown in green) and fewer query refinement layers (blue). The feature backbone outputs point features in 5 scales, while the Transformer decoder iteratively refines the instance queries. Given point features and instance queries, the mask module predicts for each query a semantic class and an instance heatmap, which (after thresholding) results in a binary instance mask.
...and 2 more figures

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

TL;DR

Abstract

Mask3D: Mask Transformer for 3D Semantic Instance Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)