Table of Contents
Fetching ...

Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding

Yu-Qi Yang, Yu-Xiao Guo, Yang Liu

TL;DR

Swin3D++ addresses the challenge of leveraging multiple 3D indoor datasets with differing domain characteristics for pretraining. By introducing domain-specific components—domain-specific initial feature embedding, domain-specific layer normalization, domain-specific voxel prompts, and domain-modulated cRSE/VM-cRSE—and a source augmentation strategy, the method effectively mitigates domain discrepancy during multi-source pretraining. The approach, validated on Structured3D and ScanNet, achieves state-of-the-art results across 3D semantic segmentation, detection, and instance segmentation, while also enabling data-efficient learning by fine-tuning a small set of domain-specific parameters. This work demonstrates the practical value of structured multi-source pretraining for robust 3D indoor scene understanding and points to future expansion to outdoor multi-source data.

Abstract

Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision. However, 3D vision domain suffers from the lack of 3D data, and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement, due to the domain discrepancies among different 3D datasets that impede effective feature learning. In this work, we identify the main sources of the domain discrepancies between 3D indoor scene datasets, and propose Swin3D++, an enhanced architecture based on Swin3D for efficient pretraining on multi-source 3D point clouds. Swin3D++ introduces domain-specific mechanisms to Swin3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining. Moreover, we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining. We validate the effectiveness of our design, and demonstrate that Swin3D++ surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks. Our code and models will be released at https://github.com/microsoft/Swin3D

Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene Understanding

TL;DR

Swin3D++ addresses the challenge of leveraging multiple 3D indoor datasets with differing domain characteristics for pretraining. By introducing domain-specific components—domain-specific initial feature embedding, domain-specific layer normalization, domain-specific voxel prompts, and domain-modulated cRSE/VM-cRSE—and a source augmentation strategy, the method effectively mitigates domain discrepancy during multi-source pretraining. The approach, validated on Structured3D and ScanNet, achieves state-of-the-art results across 3D semantic segmentation, detection, and instance segmentation, while also enabling data-efficient learning by fine-tuning a small set of domain-specific parameters. This work demonstrates the practical value of structured multi-source pretraining for robust 3D indoor scene understanding and points to future expansion to outdoor multi-source data.

Abstract

Data diversity and abundance are essential for improving the performance and generalization of models in natural language processing and 2D vision. However, 3D vision domain suffers from the lack of 3D data, and simply combining multiple 3D datasets for pretraining a 3D backbone does not yield significant improvement, due to the domain discrepancies among different 3D datasets that impede effective feature learning. In this work, we identify the main sources of the domain discrepancies between 3D indoor scene datasets, and propose Swin3D++, an enhanced architecture based on Swin3D for efficient pretraining on multi-source 3D point clouds. Swin3D++ introduces domain-specific mechanisms to Swin3D's modules to address domain discrepancies and enhance the network capability on multi-source pretraining. Moreover, we devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining. We validate the effectiveness of our design, and demonstrate that Swin3D++ surpasses the state-of-the-art 3D pretraining methods on typical indoor scene understanding tasks. Our code and models will be released at https://github.com/microsoft/Swin3D
Paper Structure (44 sections, 12 equations, 6 figures, 10 tables)

This paper contains 44 sections, 12 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Window sparsity and signal variation analysis on windows of size $5\times 5 \times 5$. The voxel size is set to 2cm. (a) The normalized cumulative histogram (NCH) of the ratio of occupied voxels in the window. (b), (c), and (d) represent the NCHs of the variances of the positions, colors, and normals of the points in the window, respectively. All statistical calculations are based on the average of 200 scenes for each dataset. The variance ranges are normalized to $[0, 1]$ in the figures.
  • Figure 2: Overview of network architectures of Swin3D and Swin3D++. $N_1, N_2, N_3, N_4, N_5 = 2, 4, 9, 4, 4$. BN, LN and FC refer to batch normalization, layer normalization and fully-connected layer. Q, K, V are Key, Query and Value tensors of self-attention.
  • Figure 3: Segmentation accuracy of Swin3D on the Structure3D validation set. The x-axis represents the count of nonempty voxels in a $5\times5 \times 5$ grid with a voxel size of 2cm. The y-axis indicates the average segmentation accuracy across windows with an equal number of nonempty voxels.
  • Figure 4: Effect of domain-specific voxel prompt on the segmentation accuracy on the Structure3D validation set.
  • Figure 5: Visual comparison of ScanNet segmentation. Left: Ground-truth segmentation labels. Middle: Swin3D-L yang2023swin3d's results. Right: Swin3D++'s results.
  • ...and 1 more figures