Table of Contents
Fetching ...

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Guangda Ji, Silvan Weder, Francis Engelmann, Marc Pollefeys, Hermann Blum

TL;DR

This paper tackles the data bottleneck in real-world 3D indoor scene understanding by introducing ARKit LabelMaker, the largest automatically labeled real-world 3D dataset with 186 semantic classes created from ARKitScenes using LabelMakerV2. It demonstrates that large-scale auto-labeled real-world data provides substantial gains for 3D semantic segmentation, improving both vanilla and transformer-based models on ScanNet and ScanNet200, and yielding notable tail-class improvements. The authors further enhance the labeling pipeline with Grounded-SAM integration and gravity alignment, while omitting the expensive NeuS lift to maintain scalability, and show that real-world data can match or exceed synthetic-data benefits, with promising transferability to downstream tasks and zero-shot settings. Collectively, the work provides evidence that scaling real-world auto-labeled 3D data can drive substantial performance gains, and offers a practical data-generation path via mobile integration for broad, scalable 3D perception research.

Abstract

Neural network performance scales with both model size and data volume, as shown in both language and image processing. This requires scaling-friendly architectures and large datasets. While transformers have been adapted for 3D vision, a `GPT-moment' remains elusive due to limited training data. We introduce ARKit LabelMaker, a large-scale real-world 3D dataset with dense semantic annotation that is more than three times larger than prior largest dataset. Specifically, we extend ARKitScenes with automatically generated dense 3D labels using an extended LabelMaker pipeline, tailored for large-scale pre-training. Training on our dataset improves accuracy across architectures, achieving state-of-the-art 3D semantic segmentation scores on ScanNet and ScanNet200, with notable gains on tail classes. Our code is available at https://labelmaker.org and our dataset at https://huggingface.co/datasets/labelmaker/arkit_labelmaker.

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

TL;DR

This paper tackles the data bottleneck in real-world 3D indoor scene understanding by introducing ARKit LabelMaker, the largest automatically labeled real-world 3D dataset with 186 semantic classes created from ARKitScenes using LabelMakerV2. It demonstrates that large-scale auto-labeled real-world data provides substantial gains for 3D semantic segmentation, improving both vanilla and transformer-based models on ScanNet and ScanNet200, and yielding notable tail-class improvements. The authors further enhance the labeling pipeline with Grounded-SAM integration and gravity alignment, while omitting the expensive NeuS lift to maintain scalability, and show that real-world data can match or exceed synthetic-data benefits, with promising transferability to downstream tasks and zero-shot settings. Collectively, the work provides evidence that scaling real-world auto-labeled 3D data can drive substantial performance gains, and offers a practical data-generation path via mobile integration for broad, scalable 3D perception research.

Abstract

Neural network performance scales with both model size and data volume, as shown in both language and image processing. This requires scaling-friendly architectures and large datasets. While transformers have been adapted for 3D vision, a `GPT-moment' remains elusive due to limited training data. We introduce ARKit LabelMaker, a large-scale real-world 3D dataset with dense semantic annotation that is more than three times larger than prior largest dataset. Specifically, we extend ARKitScenes with automatically generated dense 3D labels using an extended LabelMaker pipeline, tailored for large-scale pre-training. Training on our dataset improves accuracy across architectures, achieving state-of-the-art 3D semantic segmentation scores on ScanNet and ScanNet200, with notable gains on tail classes. Our code is available at https://labelmaker.org and our dataset at https://huggingface.co/datasets/labelmaker/arkit_labelmaker.

Paper Structure

This paper contains 44 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Our LabelMaker annotation data creates the world's largest real-world 3D scene annotation dataset.
  • Figure 2: Dependency graph of the LabelMakerV2 pipeline. Our LabelMakerV2 pipeline has a clear dependency structure that has to be handled in the distributed processing of the data. This has to be especially respected when recovering from job failure. There, our recovery strategy checks for unfinished jobs in the dependency graph before submitting any new jobs to avoid unnecessarily wasting compute resources. The boxes with thick green frame donotes visualizable tasks. These are used during inspection and job quality assurance.
  • Figure 3: Qualitative Evaluation of Gravity Alignment. LabelMaker annotation with and without gravity alignment. Without gravity alignment, floors may be misclassified as walls, walls as ceilings, as well as other orientation-dependent objects.
  • Figure 4: Correctly predicted tail class points on ScanNet200 validation set. We compare the number of correctly predicted points of selected tail class in ScanNet200 validation sets between PTv3 trained from scratch and the PTv3-PPT trained with our datasets. With our dataset, Point Transformer gains more ability to detect rase classes. Tail classes that are not predicted by any models are ignored in this plot, and we present the full tail class performance difference in the supplementary.
  • Figure 5: Visualization on ARKitScenes. From left to right: 3D scene, ground truth annotation (black regions indicate unannotated areas), LabelMaker annotations, OpenScene predictions.
  • ...and 1 more figures