Table of Contents
Fetching ...

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

Junha Lee, Chunghyun Park, Jaesung Choe, Yu-Chiang Frank Wang, Jan Kautz, Minsu Cho, Chris Choy

TL;DR

Open-vocabulary 3D scene understanding is challenged by data scarcity; the authors propose Mosaic3D-5.6M to generate 3D mask-text pairs at scale using a pipeline that combines Grounded-SAM/SEEM for precise masks and Osprey for region captions, producing 5.6M pairs across 30K scenes. They then train Mosaic3D, a language-aligned 3D encoder, with a contrastive objective $L_{point}$ to align per-point features with text embeddings, followed by a lightweight mask decoder trained with losses $\mathcal{L}_{obj}$, $\mathcal{L}_{dice}$, $\mathcal{L}_{bce}$, and $\mathcal{L}_{cap}$ for open-vocabulary segmentation. The resulting model achieves state-of-the-art results on ScanNet200, Matterport3D, and ScanNet++ and ablations show that dataset scale and caption richness are crucial. By leveraging 2D vision-language foundations to supervise 3D understanding, Mosaic3D enables scalable, open-vocabulary 3D segmentation with potential impact on robotics, AR/VR, and autonomous systems.

Abstract

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation

TL;DR

Open-vocabulary 3D scene understanding is challenged by data scarcity; the authors propose Mosaic3D-5.6M to generate 3D mask-text pairs at scale using a pipeline that combines Grounded-SAM/SEEM for precise masks and Osprey for region captions, producing 5.6M pairs across 30K scenes. They then train Mosaic3D, a language-aligned 3D encoder, with a contrastive objective to align per-point features with text embeddings, followed by a lightweight mask decoder trained with losses , , , and for open-vocabulary segmentation. The resulting model achieves state-of-the-art results on ScanNet200, Matterport3D, and ScanNet++ and ablations show that dataset scale and caption richness are crucial. By leveraging 2D vision-language foundations to supervise 3D understanding, Mosaic3D enables scalable, open-vocabulary 3D segmentation with potential impact on robotics, AR/VR, and autonomous systems.

Abstract

We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework. Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware Vision-Language Models, we develop an automatic pipeline that generates high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs, significantly larger than existing datasets. Building upon this data, we propose Mosaic3D, a foundation model combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation tasks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data.

Paper Structure

This paper contains 30 sections, 5 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Mosaic3D-5.6M. Mosaic3D-5.6M is a large-scale dataset generated from a collection of existing datasets dai2017scannetbaruch2021arkitscenesyeshwanth2023scannet++chang2017matterport3dzheng2020structured3d, consisting of 5.6M mask-text pairs, providing fine-grained masks (black outline in the figure) and detailed captions (text with matching color) pairs. Using this large-scale dataset, we propose Mosaic3D, a foundation model for open-vocabulary 3D segmentation.
  • Figure 2: Dataset comparison. We compare datasets using three metrics: # Nouns (the total number of unique normalized nouns in captions; higher is better), Coverage (the percentage of 3D points with associated captions per scene; higher is better), and Entropy (the entropy of GT instance ID distribution within masks; lower means more homogeneity - hense better). (a) Mosaic3D-5.6M uses precise masks (Entropy: 60.7) with region-aware VLMs for detailed descriptions (# Nouns: 29.9K). (b) OV3D jiang2024open produces simple attribute labels (# Nouns: 2.5K) lacking comprehensive visual descriptions. (c) RegionPLC yang2024regionplc uses coarse bounding boxes, yielding imprecise masks (Entropy: 81.0).
  • Figure 3: Mosaic3D-5.6M data engine. Our data generation process consists of three key steps: (a) We predict object segments for each RGB frame using state-of-the-art image segmentation models samravi2024samzou2024segment. (b) We pass the images and predicted masks to a region-aware vision-language model yuan2024osprey to generate descriptive captions for each region. (c) We project the 2D segmentation masks onto 3D points using camera parameters to create (d) 3D mask-text pairs. This pipeline enables us to generate a large-scale dataset of 3D mask-text pairs.
  • Figure 4: Statistics of 3D mask-text datasets. We show the total number of scenes, tokens for generated captions. Our Mosaic3D-5.6M significantly surpasses previous datasets in scale, combining multiple datasets to create the largest 3D mask-text dataset to date.
  • Figure 5: Mosaic3D model. Mosaic3D model is a SparseUNet mink trained with our Mosaic3D-5.6M dataset to extract language-aligned features from 3D point clouds. A mask decoder with positional encodings (P.E) is trained on top to enable instance segmentation.
  • ...and 6 more figures