Table of Contents
Fetching ...

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

Ziyu Zhu, Xiaojian Ma, Yixin Chen, Zhidong Deng, Siyuan Huang, Qing Li

TL;DR

3D-VisTA introduces a simple Transformer-based framework for aligning 3D vision with natural language, eschewing task-specific modules in favor of self-attention and spatial-relations aware fusion. It is pre-trained on ScanScribe, a large-scale dataset of 3D scene-text pairs, using self-supervised objectives (MLM, MOM, STM) to learn robust 3D-text alignment. Finetuning across visual grounding, dense captioning, QA, and situated reasoning yields state-of-the-art results and notable data efficiency, demonstrating strong transfer to diverse 3D-VL tasks. The work highlights the potential of unified, foundation-model-like approaches in 3D-VL and points to future scaling of data and joint optimization of auxiliary components.

Abstract

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

TL;DR

3D-VisTA introduces a simple Transformer-based framework for aligning 3D vision with natural language, eschewing task-specific modules in favor of self-attention and spatial-relations aware fusion. It is pre-trained on ScanScribe, a large-scale dataset of 3D scene-text pairs, using self-supervised objectives (MLM, MOM, STM) to learn robust 3D-text alignment. Finetuning across visual grounding, dense captioning, QA, and situated reasoning yields state-of-the-art results and notable data efficiency, demonstrating strong transfer to diverse 3D-VL tasks. The work highlights the potential of unified, foundation-model-like approaches in 3D-VL and points to future scaling of data and joint optimization of auxiliary components.

Abstract

3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vision and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and dense captioning to question answering and situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning.
Paper Structure (21 sections, 7 equations, 7 figures, 11 tables)

This paper contains 21 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Overall framework of our 3D-VisTA pipeline. We collect diverse prompts, scene graphs, 3D scans, and objects to construct ScanScribe dataset. Through self-supervised pre-training, 3D-VisTA supports various downstream tasks including 3D visual grounding, dense captioning, question answering, and situated reasoning.
  • Figure 2: The model architecture of our 3D-VisTA, which includes text encoding, scene encoding, and multi-modal fusion modules. 3D-VisTA is pre-trained by self-supervised learning objectives, which include masked language modeling, masked object modeling, and scene-text matching. Pre-trained 3D-VisTA can be easily adapted to various downstream tasks by adding lightweight task heads without task-specific design like auxiliary losses and optimization tricks.
  • Figure 3: The performance of finetuning 3D-VisTA using various amounts of training data.
  • Figure 4: Qualitative results for various tasks. Italic text stand for the inputs, blue boxes or text for the predictions from 3D-VisTA trained from scratch, red for the predictions from pre-trained 3D-VisTA, and green for the ground truth, respectively. The results show that pre-training improves the understanding of spatial relations, visual concepts, and situations.
  • Figure 5: The performance gap between scratch and pre-training over different sentence lengths ($\leq 15, \leq 30, > 30$) in ScanRefer.
  • ...and 2 more figures