Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang; Yu-Xiao Guo; Jian-Yu Xiong; Yang Liu; Hao Pan; Peng-Shuai Wang; Xin Tong; Baining Guo

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, Baining Guo

TL;DR

Swin3D presents a scalable pretrained 3D transformer backbone for indoor scene understanding by adapting the Swin Transformer to sparse voxels. It introduces memory-efficient self-attention and a generalized contextual relative signal encoding to handle irregular point signals, enabling large-scale pretraining on synthetic Structured3D data. Empirical results show strong transfer to real datasets for semantic segmentation and 3D object detection, surpassing state-of-the-art methods and outperforming other pretrained backbones. The work highlights the importance of data scale and signal-aware attention in 3D pretraining and provides a foundation for future multimodal and self-supervised extensions.

Abstract

The use of pretrained backbones with fine-tuning has been successful for 2D vision and natural language processing tasks, showing advantages over task-specific networks. In this work, we introduce a pretrained 3D backbone, called {\SST}, for 3D indoor scene understanding. We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets, but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach. The code and models are available at https://github.com/microsoft/Swin3D .

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

TL;DR

Abstract

Paper Structure (65 sections, 7 equations, 8 figures, 13 tables)

This paper contains 65 sections, 7 equations, 8 figures, 13 tables.

Introduction
Related Work
Vision transformers
3D transformers for point cloud understanding
Pretrained 3D backbones
Architecture overview
Naive 3D extension of Swin Transformer
Memory complexity
Signal irregularity
Swin3D Architecture
Module design of Swin3D Transformer
Voxelization
Point cloud input
Voxelization
Initial feature embedding
...and 50 more sections

Figures (8)

Figure 1: A 2D illustration of sparse points in a $4\times 4$ window. (a) A fully-occupied window. (b) A sparsely-occuppied window, with white cells being empty. (c) Regularly-distributed sparse points in a window. (d) The sparse points are irregularly distributed in the window, and different circle colors indicate the varying point-wise signal, such as the RGB color. For simplicity, only one point is drawn on non-empty cells.
Figure 2: Left: The architecture of Swin3D. It consists of five-stage transformer blocks that apply self-attention to sparse voxels within regular and shifted windows at various levels of a hierarchical sparse grid. The grids on the left are the illustration of sparse grids in 2D, with gray cells representing non-empty voxels. $N_i$ denotes the number of sparse voxels at the $i$-th level, and $C_i$ is the feature channel dimension. Right: The detailed operations of each module.
Figure 3: Evaluation on the support to large models. Top-left: GPU memory consumption of wider networks. Top-right: GPU memory consumption of deeper networks. Bottom-left: GPU memory consumption with respect to head numbers. Bottom-right: GPU memory consumption with respect to window size. The GPU memory was measured on a forward-and-backward iteration, averaged on all ScanNet data.
Figure 4: Visual comparison of ScanNet segmentation. Left: Ground truth segmentation labels (points color in black are not labeled in the original dataset). Middle: Stratified Transformer's results. Right: Swin3D$_n$-L's results.
Figure 5: Visual comparison of ScanNet 3D detection. Left: Ground truth. Middle: FCAF3D's result. Right: Swin3D-L+CAGroup3D's results. Note that the original CAGroup3D paper does not release its checkpoint, so no visual results provided in this figure.
...and 3 more figures

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

TL;DR

Abstract

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)