Table of Contents
Fetching ...

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

TL;DR

This work tackles the scarcity and complexity of 3D vision-language grounding by introducing SceneVerse, the first million-scale 3D VL dataset with 68K scenes and 2.5M scene-language pairs, generated through human annotations and scalable scene-graph-based generation. It then presents Grounded Pre-training for Scenes (GPS), a transformer-based framework that performs multi-level contrastive alignment across object-level, scene-level, and referral-object-level descriptions, augmented with a masked language modeling objective. GPS achieves state-of-the-art results on standard 3D VL grounding benchmarks and demonstrates strong zero-shot transfer, including gains in open-vocabulary 3D segmentation when pre-trained on SceneVerse. The paper further analyzes data scaling, the roles of synthetic versus real scenes, and the impact of each GPS module, offering actionable guidance for scaling 3D vision-language research and applications in embodied agents.

Abstract

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

TL;DR

This work tackles the scarcity and complexity of 3D vision-language grounding by introducing SceneVerse, the first million-scale 3D VL dataset with 68K scenes and 2.5M scene-language pairs, generated through human annotations and scalable scene-graph-based generation. It then presents Grounded Pre-training for Scenes (GPS), a transformer-based framework that performs multi-level contrastive alignment across object-level, scene-level, and referral-object-level descriptions, augmented with a masked language modeling objective. GPS achieves state-of-the-art results on standard 3D VL grounding benchmarks and demonstrates strong zero-shot transfer, including gains in open-vocabulary 3D segmentation when pre-trained on SceneVerse. The paper further analyzes data scaling, the roles of synthetic versus real scenes, and the impact of each GPS module, offering actionable guidance for scaling 3D vision-language research and applications in embodied agents.

Abstract

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.
Paper Structure (70 sections, 7 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 70 sections, 7 equations, 10 figures, 11 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of SceneVerse. A million-scale 3D vision-language dataset that comprises over $68$K various 3D indoor scenes and $2.5$M aligned scene-language pairs in the form of scene caption, object caption, and object referral.
  • Figure 2: SceneVerse collection and statistics. Given a 3D scene (a), our automated pipeline (c) generates three types of description including scene caption, object caption and object referral. (b) SceneVerse data comparison and composition.
  • Figure 3: Overview of GPS model. We use contrastive alignment at three levels $\mathcal{L}_{\text{obj}}$, $\mathcal{L}_{\text{scene}}$, and $\mathcal{L}_{\text{ref}}$ and a masked language modeling objective $\mathcal{L}_{\text{MLM}}$ for model learning.
  • Figure 4: Model performance v.s. data scale. Plots show that models consistently improve in both the pre-train and zero-shot transfer settings on ScanRefer and SceneVerse-val with data scaling-up.
  • Figure A.1: Overview of the relationships in SceneVerse. The target object is colored in blue.
  • ...and 5 more figures