Table of Contents
Fetching ...

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

Haochen Wang, Yucheng Zhao, Tiancai Wang, Haoqiang Fan, Xiangyu Zhang, Zhaoxiang Zhang

TL;DR

This work introduces Ross3D, a generalist 3D scene understanding model that embeds 3D-aware supervision into visual instruction tuning rather than relying on input-level 3D representations. It proposes two 3D-centric pretext tasks—cross-view reconstruction and global-view reconstruction—to learn accurate spatial relationships and holistic scene layouts from multi-view video frames, BEV renders, and depth-informed position cues. By leveraging a diffusion-based denoising objective and a standard text-generation loss, Ross3D achieves state-of-the-art results across 3D QA, dense captioning, and visual grounding benchmarks, and demonstrates notable semi-supervised capabilities using unlabeled 3D data. The results underscore the potential of 3D-aware visual supervision signals for scalable 3D LMMs and point to future directions in designing task-aligned 3D pretext signals.

Abstract

The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness

TL;DR

This work introduces Ross3D, a generalist 3D scene understanding model that embeds 3D-aware supervision into visual instruction tuning rather than relying on input-level 3D representations. It proposes two 3D-centric pretext tasks—cross-view reconstruction and global-view reconstruction—to learn accurate spatial relationships and holistic scene layouts from multi-view video frames, BEV renders, and depth-informed position cues. By leveraging a diffusion-based denoising objective and a standard text-generation loss, Ross3D achieves state-of-the-art results across 3D QA, dense captioning, and visual grounding benchmarks, and demonstrates notable semi-supervised capabilities using unlabeled 3D data. The results underscore the potential of 3D-aware visual supervision signals for scalable 3D LMMs and point to future directions in designing task-aligned 3D pretext signals.

Abstract

The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (Ross3D), which integrates 3D-aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird's-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, Ross3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data.

Paper Structure

This paper contains 36 sections, 10 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Performance of Ross3D compared with state-of-the-art alternatives. We report EM on SQA3D ma2023sqa3d, CIDEr on ScanQA azuma2022scanqa, ROUGE on Scan2Cap chen2021scan2cap, Acc@0.25 on ScanRefer chen2020scanrefer, and F1@0.25 on Multi3DRefer zhang2023multi3drefer. With 3D-aware visual supervision, Ross3D significantly outperforms other approaches across various benchmarks.
  • Figure 2: Conceptual comparison of our Ross3D with popular paradigms. Unlike previous methods that preliminarily focus on input-level modifications to craft 3D-aware input representations, we incorporate 3D-aware visual pretext tasks.
  • Figure 3: Illustration of (a) Ross3D and (b) the detailed architecture of the denoiser $\mathcal{J}_{\pi}$. (a) Given raw video frames $\bm{I}$ for a 3D scene, we apply transformations to obtain inputs $\mathcal{T}_i(\bm{I})$ and targets $\mathcal{T}_o(\bm{I})$, respectively, and subsequently encourage LMMs to recover clean latent tokens $\bm{z}_0 = \mathcal{F} \circ \mathcal{T}_o(\bm{I})$ using noisy tokens $\bm{z}_t$ and visual outputs $\bm{x}_{i \leq N}$. (b) The denoiser is based on DiT peebles2023dit. Condition $\bm{c}$ is computed by a set of learnable queries $\bm{q}$, visual outputs $\bm{x}_{i \leq N}$, and timesteps $t$.
  • Figure 4: Qualitative results. Examples of 3D question answering and 3D visual grounding are sampled from ScanQA$_{\text{val}}$ma2023sqa3d and ScanRefer$_{\text{val}}$chen2020scanrefer, respectively.