Table of Contents
Fetching ...

MASSTAR: A Multi-Modal and Large-Scale Scene Dataset with a Versatile Toolchain for Surface Prediction and Completion

Guiyong Zheng, Jinqi Jiang, Chen Feng, Shaojie Shen, Boyu Zhou

TL;DR

The paper addresses the challenge of scalable, multi-modal surface prediction and completion for large-scale scenes by introducing MASSTAR, a versatile toolchain that converts raw 3D data into a large-scale, multi-modal scene dataset. MASSTAR yields over 1000 scene-level 3D meshes with accompanying images, descriptive texts, and partial point clouds, enabling realistic benchmarking beyond object-level datasets. Benchmarking with SPM, PCN, and XMFNet shows that existing surface completion methods struggle with scene-level data, highlighting the value of multi-modal context and scale. The authors provide an open-source toolchain and dataset to spur research in robotic perception and embodied AI, with potential impact on real-world scene understanding and reconstruction tasks. Evaluation metrics include $CD$ (Chamfer Distance), $L1$-CD, $L2$-CD, precision, recall, F-score, and $AUC$, underscoring a comprehensive assessment of both accuracy and efficiency.

Abstract

Surface prediction and completion have been widely studied in various applications. Recently, research in surface completion has evolved from small objects to complex large-scale scenes. As a result, researchers have begun increasing the volume of data and leveraging a greater variety of data modalities including rendered RGB images, descriptive texts, depth images, etc, to enhance algorithm performance. However, existing datasets suffer from a deficiency in the amounts of scene-level models along with the corresponding multi-modal information. Therefore, a method to scale the datasets and generate multi-modal information in them efficiently is essential. To bridge this research gap, we propose MASSTAR: a Multi-modal lArge-scale Scene dataset with a verSatile Toolchain for surfAce pRediction and completion. We develop a versatile and efficient toolchain for processing the raw 3D data from the environments. It screens out a set of fine-grained scene models and generates the corresponding multi-modal data. Utilizing the toolchain, we then generate an example dataset composed of over a thousand scene-level models with partial real-world data added. We compare MASSTAR with the existing datasets, which validates its superiority: the ability to efficiently extract high-quality models from complex scenarios to expand the dataset. Additionally, several representative surface completion algorithms are benchmarked on MASSTAR, which reveals that existing algorithms can hardly deal with scene-level completion. We will release the source code of our toolchain and the dataset. For more details, please see our project page at https://sysu-star.github.io/MASSTAR.

MASSTAR: A Multi-Modal and Large-Scale Scene Dataset with a Versatile Toolchain for Surface Prediction and Completion

TL;DR

The paper addresses the challenge of scalable, multi-modal surface prediction and completion for large-scale scenes by introducing MASSTAR, a versatile toolchain that converts raw 3D data into a large-scale, multi-modal scene dataset. MASSTAR yields over 1000 scene-level 3D meshes with accompanying images, descriptive texts, and partial point clouds, enabling realistic benchmarking beyond object-level datasets. Benchmarking with SPM, PCN, and XMFNet shows that existing surface completion methods struggle with scene-level data, highlighting the value of multi-modal context and scale. The authors provide an open-source toolchain and dataset to spur research in robotic perception and embodied AI, with potential impact on real-world scene understanding and reconstruction tasks. Evaluation metrics include (Chamfer Distance), -CD, -CD, precision, recall, F-score, and , underscoring a comprehensive assessment of both accuracy and efficiency.

Abstract

Surface prediction and completion have been widely studied in various applications. Recently, research in surface completion has evolved from small objects to complex large-scale scenes. As a result, researchers have begun increasing the volume of data and leveraging a greater variety of data modalities including rendered RGB images, descriptive texts, depth images, etc, to enhance algorithm performance. However, existing datasets suffer from a deficiency in the amounts of scene-level models along with the corresponding multi-modal information. Therefore, a method to scale the datasets and generate multi-modal information in them efficiently is essential. To bridge this research gap, we propose MASSTAR: a Multi-modal lArge-scale Scene dataset with a verSatile Toolchain for surfAce pRediction and completion. We develop a versatile and efficient toolchain for processing the raw 3D data from the environments. It screens out a set of fine-grained scene models and generates the corresponding multi-modal data. Utilizing the toolchain, we then generate an example dataset composed of over a thousand scene-level models with partial real-world data added. We compare MASSTAR with the existing datasets, which validates its superiority: the ability to efficiently extract high-quality models from complex scenarios to expand the dataset. Additionally, several representative surface completion algorithms are benchmarked on MASSTAR, which reveals that existing algorithms can hardly deal with scene-level completion. We will release the source code of our toolchain and the dataset. For more details, please see our project page at https://sysu-star.github.io/MASSTAR.
Paper Structure (16 sections, 5 equations, 8 figures, 4 tables)

This paper contains 16 sections, 5 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We propose a multi-modal dataset composed of plenty of large-scale scene data for 3D surface prediction and completion, as well as a versatile and efficient toolchain to create such a dataset from raw 3D data from environments.
  • Figure 2: The structure of our dataset. We release a lightweight format dataset that contains only 3D mesh models, and users can generate a complete format dataset using our toolchain.
  • Figure 3: A comparison of 3D scene models in MASSTAR with those in other datasets. While the models in the former datasets suffer from different drawbacks, the 3D scene models in MASSTAR feature complete surfaces, well-segmented scenes, object-level models, and high quality.
  • Figure 4: An overview of 3D scene segmentation. Initially, we generate the depth image and RGB image by rendering a bird's-eye view of each scene. Users have the option to employ SAMkirillov2023segany for segmenting top-view images in manual mode or automatic mode. Subsequently, the 3D mesh model is sliced using Blender, and then CLIPradford2021learning is utilized to filter out non-architectural categories.
  • Figure 5: An example of the image rendering part of the toolchain. We offer the random mode (left) and trajectory mode (right) for users.
  • ...and 3 more figures