Table of Contents
Fetching ...

Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding

Yuchen Rao, Stefan Ainetter, Sinisa Stekovic, Vincent Lepetit, Friedrich Fraundorfer

TL;DR

This paper tackles the lack of high-quality 3D annotations for indoor scenes by introducing an automatic CAD annotation pipeline that extends HOC-Search to ScanNet++ v1, producing SCANnotate++ with CAD models and 9D poses for over 5k objects in 280 scans. The authors demonstrate that training supervised models for point cloud completion and single-view CAD model retrieval/alignment on these automatic annotations yields improvements over manually annotated baselines, and that additional automatic data further boosts performance. They also show that the learned models generalize to ScanNet++ and that pretraining on automatic annotations enhances results. The work culminates in releasing SCANnotate++ and the trained models to spur further research in 3D scene understanding and annotation-efficient learning.

Abstract

High-level 3D scene understanding is essential in many applications. However, the challenges of generating accurate 3D annotations make development of deep learning models difficult. We turn to recent advancements in automatic retrieval of synthetic CAD models, and show that data generated by such methods can be used as high-quality ground truth for training supervised deep learning models. More exactly, we employ a pipeline akin to the one previously used to automatically annotate objects in ScanNet scenes with their 9D poses and CAD models. This time, we apply it to the recent ScanNet++ v1 dataset, which previously lacked such annotations. Our findings demonstrate that it is not only possible to train deep learning models on these automatically-obtained annotations but that the resulting models outperform those trained on manually annotated data. We validate this on two distinct tasks: point cloud completion and single-view CAD model retrieval and alignment. Our results underscore the potential of automatic 3D annotations to enhance model performance while significantly reducing annotation costs. To support future research in 3D scene understanding, we will release our annotations, which we call SCANnotate++, along with our trained models.

Leveraging Automatic CAD Annotations for Supervised Learning in 3D Scene Understanding

TL;DR

This paper tackles the lack of high-quality 3D annotations for indoor scenes by introducing an automatic CAD annotation pipeline that extends HOC-Search to ScanNet++ v1, producing SCANnotate++ with CAD models and 9D poses for over 5k objects in 280 scans. The authors demonstrate that training supervised models for point cloud completion and single-view CAD model retrieval/alignment on these automatic annotations yields improvements over manually annotated baselines, and that additional automatic data further boosts performance. They also show that the learned models generalize to ScanNet++ and that pretraining on automatic annotations enhances results. The work culminates in releasing SCANnotate++ and the trained models to spur further research in 3D scene understanding and annotation-efficient learning.

Abstract

High-level 3D scene understanding is essential in many applications. However, the challenges of generating accurate 3D annotations make development of deep learning models difficult. We turn to recent advancements in automatic retrieval of synthetic CAD models, and show that data generated by such methods can be used as high-quality ground truth for training supervised deep learning models. More exactly, we employ a pipeline akin to the one previously used to automatically annotate objects in ScanNet scenes with their 9D poses and CAD models. This time, we apply it to the recent ScanNet++ v1 dataset, which previously lacked such annotations. Our findings demonstrate that it is not only possible to train deep learning models on these automatically-obtained annotations but that the resulting models outperform those trained on manually annotated data. We validate this on two distinct tasks: point cloud completion and single-view CAD model retrieval and alignment. Our results underscore the potential of automatic 3D annotations to enhance model performance while significantly reducing annotation costs. To support future research in 3D scene understanding, we will release our annotations, which we call SCANnotate++, along with our trained models.

Paper Structure

This paper contains 25 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We provide high-quality shape and pose annotations for objects in RGB-D scans for the ScanNet++ dataset. Because our pipeline is automatic, we can annotate almost all objects in the scene even when they are only partially visible, and the annotations are of consistently high quality. We demonstrate that we can use these annotations for supervised learning including point cloud completion (Task 1) and CAD model retrieval and alignment (Task 2), and that the model performance can be increased either by using the annotations as additional training data, or by using them to fine tune a pre-trained model for a previously unseen dataset.
  • Figure 2: Our automatic annotation pipeline. For a given RGB-D scan, we first use the provided 3D object instance segmentation to estimate the pose of each object, visualized as red bounding boxes. The scan, 3D segmentation and initial estimated poses are then used as input for our annotation method. As final result, we provide high quality 3D shape annotations in the form of CAD models retrieved from a large shape database, and corresponding 9D pose annotations for all target objects.
  • Figure 3: Class histogram of the SCANnotate++ dataset. The histogram shows the typical long tail distribution of objects in indoor scenes, whereas common classes like chair, cabinet, and table make up the majority of the annotated objects.
  • Figure 4: Examples of our SCANnotate++ annotations. Our annotations accurately capture the 3D geometry of target objects, and their re-projections in the images align accurately with the objects.
  • Figure 5: Overview of our point cloud completion pipeline. First, we train an auto-encoder based on ShapeGF cai2020learninggradientfieldsshape using complete point clouds $x_{c}$. Next, we introduce a separate encoder $g_{enc}$ for partial input point clouds $x_{p}$ from ScanNet dai2017scannet and ScanNet++ yeshwanth2023scannetpp and we train it to minimize the difference between latent representations of the ground truth point cloud $z_{c}$ and partial point cloud $z_{p}$. The decoder $f_{dec}$ from the pretrained auto-encoder reconstructs the complete point cloud $x_{\tilde{p}}$ from latent $z_{p}$.
  • ...and 2 more figures