Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy
TL;DR
Open-vocabulary 3D scene understanding is challenged by data scarcity; the authors propose Mosaic3D-5.6M to generate 3D mask-text pairs at scale using a pipeline that combines Grounded-SAM/SEEM for precise masks and Osprey for region captions, producing 5.6M pairs across 30K scenes. They then train Mosaic3D, a language-aligned 3D encoder, with a contrastive objective $L_{point}$ to align per-point features with text embeddings, followed by a lightweight mask decoder trained with losses $\mathcal{L}_{obj}$, $\mathcal{L}_{dice}$, $\mathcal{L}_{bce}$, and $\mathcal{L}_{cap}$ for open-vocabulary segmentation. The resulting model achieves state-of-the-art results on ScanNet200, Matterport3D, and ScanNet++ and ablations show that dataset scale and caption richness are crucial. By leveraging 2D vision-language foundations to supervise 3D understanding, Mosaic3D enables scalable, open-vocabulary 3D segmentation with potential impact on robotics, AR/VR, and autonomous systems.
Abstract
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.
