ASSIST-3D: Adapted Scene Synthesis for Class-Agnostic 3D Instance Segmentation
Shengchao Zhou, Jiehong Lin, Jiahui Liu, Shizhen Zhao, Chirui Chang, Xiaojuan Qi
TL;DR
The paper tackles the data scarcity barrier in class-agnostic 3D instance segmentation by introducing ASSIST-3D, a purpose-built pipeline that synthesize richly annotated 3D scenes. It combines heterogeneous object selection from large CAD asset collections, GPT-4 guided scene layout with depth-first placement, and realistic multi-view RGB-D-based point cloud construction to bridge the gap between synthetic and real data. By training a strong baseline (Mask3D) on ScanNetV2 augmented with ASSIST-3D data, the approach achieves state-of-the-art performance on ScanNet++, S3DIS, and in-domain ScanNetV2, with extensive ablations validating the importance of geometry diversity, context complexity, and realistic sensing. The results demonstrate that carefully designed synthetic data can substantially improve generalization to unseen object categories in 3D scenes and offer a scalable path for future class-agnostic segmentation research.
Abstract
Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.
