3D Vision and Language Pretraining with Large-Scale Synthetic Data
Dejie Yang, Zhu Xu, Wentao Mo, Qingchao Chen, Siyuan Huang, Yang Liu
TL;DR
This work tackles data scarcity in 3D vision-language pretraining by introducing SynVL3D, a large-scale synthetic dataset with 10K indoor scenes and 1M descriptions, enabling rich 3D-text grounding. It then presents SynFormer3D, a unified Transformer-based encoder trained with self-supervised MLM/MOM/SSM plus fine-grained tasks for object relations (ORP), multi-level region-word alignment (MRWA), and view-aggregated region-word alignment (VRWA), along with synthetic-to-real domain adaptation during fine-tuning. Across 3D visual grounding, dense captioning, and QA benchmarks, the approach achieves state-of-the-art results, validating the effectiveness of synthetic data and the proposed auxiliary tasks in bridging vision and language in 3D. The method reduces real-data collection costs and advances embodied intelligence by improving cross-modal understanding of complex 3D scenes.
Abstract
3D Vision-Language Pre-training (3D-VLP) aims to provide a pre-train model which can bridge 3D scenes with natural language, which is an important technique for embodied intelligence. However, current 3D-VLP datasets are hindered by limited scene-level diversity and insufficient fine-grained annotations (only 1.2K scenes and 280K textual annotations in ScanScribe), primarily due to the labor-intensive of collecting and annotating 3D scenes. To overcome these obstacles, we construct SynVL3D, a comprehensive synthetic scene-text corpus with 10K indoor scenes and 1M descriptions at object, view, and room levels, which has the advantages of diverse scene data, rich textual descriptions, multi-grained 3D-text associations, and low collection cost. Utilizing the rich annotations in SynVL3D, we pre-train a simple and unified Transformer for aligning 3D and language with multi-grained pretraining tasks. Moreover, we propose a synthetic-to-real domain adaptation in downstream task fine-tuning process to address the domain shift. Through extensive experiments, we verify the effectiveness of our model design by achieving state-of-the-art performance on downstream tasks including visual grounding, dense captioning, and question answering.
