Table of Contents
Fetching ...

SCRREAM : SCan, Register, REnder And Map:A Framework for Annotating Accurate and Dense 3D Indoor Scenes with a Benchmark

HyunJun Jung, Weihang Li, Shun-Cheng Wu, William Bittner, Nikolas Brasch, Jifei Song, Eduardo Pérez-Pellitero, Zhensong Zhang, Arthur Moreau, Nassir Navab, Benjamin Busam

TL;DR

Indoor 3D datasets traditionally sacrifice ground-truth accuracy for scale, hindering dense geometry evaluation. SCRREAM introduces a Scan, Register, Render, and Map pipeline that scans objects at high resolution, registers them into real rooms, renders realistic scenes, and maps real video frames to obtain precise camera poses and dense ground-truth depth. The approach yields publicly available data and benchmarks across four tasks—indoor reconstruction/SLAM, object removal, human reconstruction, and 6D pose estimation—enabling rigorous geometric evaluation with accurate depth ground truth. This framework lowers the barrier to objective, high-fidelity benchmarking of dense 3D methods and closes the gap between synthetic accuracy and real-world scenes, facilitating advances in NVS and SLAM research.

Abstract

Traditionally, 3d indoor datasets have generally prioritized scale over ground-truth accuracy in order to obtain improved generalization. However, using these datasets to evaluate dense geometry tasks, such as depth rendering, can be problematic as the meshes of the dataset are often incomplete and may produce wrong ground truth to evaluate the details. In this paper, we propose SCRREAM, a dataset annotation framework that allows annotation of fully dense meshes of objects in the scene and registers camera poses on the real image sequence, which can produce accurate ground truth for both sparse 3D as well as dense 3D tasks. We show the details of the dataset annotation pipeline and showcase four possible variants of datasets that can be obtained from our framework with example scenes, such as indoor reconstruction and SLAM, scene editing & object removal, human reconstruction and 6d pose estimation. Recent pipelines for indoor reconstruction and SLAM serve as new benchmarks. In contrast to previous indoor dataset, our design allows to evaluate dense geometry tasks on eleven sample scenes against accurately rendered ground truth depth maps.

SCRREAM : SCan, Register, REnder And Map:A Framework for Annotating Accurate and Dense 3D Indoor Scenes with a Benchmark

TL;DR

Indoor 3D datasets traditionally sacrifice ground-truth accuracy for scale, hindering dense geometry evaluation. SCRREAM introduces a Scan, Register, Render, and Map pipeline that scans objects at high resolution, registers them into real rooms, renders realistic scenes, and maps real video frames to obtain precise camera poses and dense ground-truth depth. The approach yields publicly available data and benchmarks across four tasks—indoor reconstruction/SLAM, object removal, human reconstruction, and 6D pose estimation—enabling rigorous geometric evaluation with accurate depth ground truth. This framework lowers the barrier to objective, high-fidelity benchmarking of dense 3D methods and closes the gap between synthetic accuracy and real-world scenes, facilitating advances in NVS and SLAM research.

Abstract

Traditionally, 3d indoor datasets have generally prioritized scale over ground-truth accuracy in order to obtain improved generalization. However, using these datasets to evaluate dense geometry tasks, such as depth rendering, can be problematic as the meshes of the dataset are often incomplete and may produce wrong ground truth to evaluate the details. In this paper, we propose SCRREAM, a dataset annotation framework that allows annotation of fully dense meshes of objects in the scene and registers camera poses on the real image sequence, which can produce accurate ground truth for both sparse 3D as well as dense 3D tasks. We show the details of the dataset annotation pipeline and showcase four possible variants of datasets that can be obtained from our framework with example scenes, such as indoor reconstruction and SLAM, scene editing & object removal, human reconstruction and 6d pose estimation. Recent pipelines for indoor reconstruction and SLAM serve as new benchmarks. In contrast to previous indoor dataset, our design allows to evaluate dense geometry tasks on eleven sample scenes against accurately rendered ground truth depth maps.

Paper Structure

This paper contains 6 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Due to the typical acquisition pipelines, traditional indoor 3D datasets can provide incomplete meshes for their scenes with missing structures and holes. Our dataset annotation pipeline, in contrast, starts from scanning individual objects in an high resolution manner and then registers them to the real scene and real camera sequence allowing highly detailed ground truth rendering for dense 3D vision tasks.
  • Figure 2: Dataset Geometry Comparison. Our dataset features high-quality, complete meshes of the scene. In comparison, commonly used datasets such as Replica replica19arxiv and ScanNet++ yeshwanth2023scannet++ suffer from over-smoothed or incomplete meshes (Zoom in for details). We include more in-depth comparisons in the supplementary material pdf and video file.
  • Figure 3: Framework Pipeline Overview. Our pipeline follows the SCRREAM scheme for annotation. (a) SCan : Scanning the individual objects in the scene, (b,c) Register : Place objects in the scene, scan the scene partially and register the pre-scanned meshes, (d) REnder : Render the synthetic images, And (e-g) Mapping : Map 3D features of synthetic image with camera poses, record the real image sequence and obtain the camera pose via feature matching. Once the camera poses are obtained, we can extract or render the 3D information via transforming the meshes into the camera frame as shown in (h).
  • Figure 4: Example for the Scan and Register Step.(a) We pre-scan all meshes in the scene before setting up the scene. This ensures that all objects and furniture are scanned in a high quality, water-tight manner. (b) Then we place the furniture in the room to setup the scene and (c) scan the entire scene (Note that the scene is not scanned completely), such that we can (d) register all pre-scanned meshes to the scene layout via manual correspondence selection followed by ICP.
  • Figure 5: Example for the Mapping Step and Qualitative Evaluation. Mapping starts with (a) generating realistic synthetic renderings from the registered mesh. Once the images are rendered, features are matched and de-projected to 3D. Then (b) the real sequence is acquired, and features are extracted to match with synthetic features to obtain the camera poses. These camera poses allow us to transform the camera frame to the mesh frame such that dense 3D annotations (c) can be rendered into the image as ground truth. We show rendered instance masks as well as depth on the real image frame to illustrate the quality of our ground truth.
  • ...and 5 more figures