Table of Contents
Fetching ...

ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation

Shengchao Zhou, Jiehong Lin, Jiahui Liu, Shizhen Zhao, Chirui Chang, Xiaojuan Qi

TL;DR

The paper tackles the data scarcity barrier in class-agnostic 3D instance segmentation by introducing ASSIST-3D, a purpose-built pipeline that synthesize richly annotated 3D scenes. It combines heterogeneous object selection from large CAD asset collections, GPT-4 guided scene layout with depth-first placement, and realistic multi-view RGB-D-based point cloud construction to bridge the gap between synthetic and real data. By training a strong baseline (Mask3D) on ScanNetV2 augmented with ASSIST-3D data, the approach achieves state-of-the-art performance on ScanNet++, S3DIS, and in-domain ScanNetV2, with extensive ablations validating the importance of geometry diversity, context complexity, and realistic sensing. The results demonstrate that carefully designed synthetic data can substantially improve generalization to unseen object categories in 3D scenes and offer a scalable path for future class-agnostic segmentation research.

Abstract

Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.

ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation

TL;DR

The paper tackles the data scarcity barrier in class-agnostic 3D instance segmentation by introducing ASSIST-3D, a purpose-built pipeline that synthesize richly annotated 3D scenes. It combines heterogeneous object selection from large CAD asset collections, GPT-4 guided scene layout with depth-first placement, and realistic multi-view RGB-D-based point cloud construction to bridge the gap between synthetic and real data. By training a strong baseline (Mask3D) on ScanNetV2 augmented with ASSIST-3D data, the approach achieves state-of-the-art performance on ScanNet++, S3DIS, and in-domain ScanNetV2, with extensive ablations validating the importance of geometry diversity, context complexity, and realistic sensing. The results demonstrate that carefully designed synthetic data can substantially improve generalization to unseen object categories in 3D scenes and offer a scalable path for future class-agnostic segmentation research.

Abstract

Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.

Paper Structure

This paper contains 19 sections, 2 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: (a) Methods adapted from conventional class-aware approaches rely heavily on limited real-world datasets and face severe data scarcity. (b) Methods that use 2D foundation models for multi-view image segmentation (representing 3D scenes) struggle with 2D segmentation errors or cross-view inconsistency issues. (c) Our proposed ASSIST-3D addresses these limitations by generating high-quality, fully-annotated synthetic 3D scenes to improve model generalization.
  • Figure 2: An overview of our proposed ASSIST-3D. First, ASSIST-3D selects heterogeneous objects from a 3D asset base like Objaverse, satisfying geometry diversity and context complexity. Next, it leverages GPT-4 to design the arrangement of these objects within the scene, ensuring layout reasonability. Finally, it constructs realistic point clouds by mimicking the construction procedure of real dataset to reduce the domain gap and enhance performance.
  • Figure 3: Performance curves of average precision on ScanNet++ yeshwanth2023scannet++ and S3DIS armeni20163d with varying numbers of object classes used for synthesis and different amounts of synthetic training data.