SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia; Yixin Chen; Huangyue Yu; Yan Wang; Xuesong Niu; Tengyu Liu; Qing Li; Siyuan Huang

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Baoxiong Jia, Yixin Chen, Huangyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang

TL;DR

This work tackles the scarcity and complexity of 3D vision-language grounding by introducing SceneVerse, the first million-scale 3D VL dataset with 68K scenes and 2.5M scene-language pairs, generated through human annotations and scalable scene-graph-based generation. It then presents Grounded Pre-training for Scenes (GPS), a transformer-based framework that performs multi-level contrastive alignment across object-level, scene-level, and referral-object-level descriptions, augmented with a masked language modeling objective. GPS achieves state-of-the-art results on standard 3D VL grounding benchmarks and demonstrates strong zero-shot transfer, including gains in open-vocabulary 3D segmentation when pre-trained on SceneVerse. The paper further analyzes data scaling, the roles of synthetic versus real scenes, and the impact of each GPS module, offering actionable guidance for scaling 3D vision-language research and applications in embodied agents.

Abstract

3D vision-language grounding, which focuses on aligning language with the 3D physical environment, stands as a cornerstone in the development of embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces several significant challenges: (i) the inherent complexity of 3D scenes due to the diverse object configurations, their rich attributes, and intricate relationships; (ii) the scarcity of paired 3D vision-language data to support grounded learning; and (iii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these three major challenges in 3D vision-language by examining the potential of systematically upscaling 3D vision-language learning in indoor environments. We introduce the first million-scale 3D vision-language dataset, SceneVerse, encompassing about 68K 3D indoor scenes and comprising 2.5M vision-language pairs derived from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (GPS), for 3D vision-language learning. Through extensive experiments, we showcase the effectiveness of GPS by achieving state-of-the-art performance on all existing 3D visual grounding benchmarks. The vast potential of SceneVerse and GPS is unveiled through zero-shot transfer experiments in the challenging 3D vision-language tasks. Project website: https://scene-verse.github.io.

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

TL;DR

Abstract

Paper Structure (70 sections, 7 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 70 sections, 7 equations, 10 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Datasets for Grounded 3D Understanding
Vision-Language Learning
Scene Curation
Referral Annotation by Humans
3D Scene Graph Construction
Language Generation with LLMs
Object Captioning
Object Referral
Scene Captioning
Data Quality and Statistics
Data Quality
Statistics
Grounded Pre-training for Scenes
...and 55 more sections

Figures (10)

Figure 1: Overview of SceneVerse. A million-scale 3D vision-language dataset that comprises over $68$K various 3D indoor scenes and $2.5$M aligned scene-language pairs in the form of scene caption, object caption, and object referral.
Figure 2: SceneVerse collection and statistics. Given a 3D scene (a), our automated pipeline (c) generates three types of description including scene caption, object caption and object referral. (b) SceneVerse data comparison and composition.
Figure 3: Overview of GPS model. We use contrastive alignment at three levels $\mathcal{L}_{\text{obj}}$, $\mathcal{L}_{\text{scene}}$, and $\mathcal{L}_{\text{ref}}$ and a masked language modeling objective $\mathcal{L}_{\text{MLM}}$ for model learning.
Figure 4: Model performance v.s. data scale. Plots show that models consistently improve in both the pre-train and zero-shot transfer settings on ScanRefer and SceneVerse-val with data scaling-up.
Figure A.1: Overview of the relationships in SceneVerse. The target object is colored in blue.
...and 5 more figures

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

TL;DR

Abstract

SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (10)