Table of Contents
Fetching ...

3D Question Answering via only 2D Vision-Language Models

Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

TL;DR

This paper tackles 3D question answering in a zero-shot setting by using only 2D vision-language models. It introduces cdViews, a framework that jointly learns to pick critical and diverse 2D views via a viewSelector and viewNMS, avoiding explicit 3D-language alignment. Experiments on ScanQA and SQA show state-of-the-art performance against 3D and hybrid methods while maintaining computational efficiency. The work suggests that pre-trained 2D LVLMs, when paired with effective view selection, are a highly viable and efficient alternative for 3D understanding tasks.

Abstract

Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

3D Question Answering via only 2D Vision-Language Models

TL;DR

This paper tackles 3D question answering in a zero-shot setting by using only 2D vision-language models. It introduces cdViews, a framework that jointly learns to pick critical and diverse 2D views via a viewSelector and viewNMS, avoiding explicit 3D-language alignment. Experiments on ScanQA and SQA show state-of-the-art performance against 3D and hybrid methods while maintaining computational efficiency. The work suggests that pre-trained 2D LVLMs, when paired with effective view selection, are a highly viable and efficient alternative for 3D understanding tasks.

Abstract

Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

Paper Structure

This paper contains 15 sections, 13 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of 3D Question Answering methods. (a): a1 for 3D-based methods; a2 and a3 for hybrid (2D+3D) methods. All of these methods require computationally intensive 3D-language alignment using point cloud data for spatial reasoning. a4 is our method that leverages pre-trained LVLMs operating solely on 2D views. The well-aligned features between 2D visual features and language in 2D LVLMs enable zero-shot 3D-QA. (b): Model comparison on the test set (with objects) of ScanQA. The upper-right corner indicates the best performance. The circle area represents the size of training data required for aligning 3D and language. The "✕" denotes zero-shot 3D-QA using 2D model LLAVA-OV li2024llava. We respectively use ① uniform sampling, ② image retrieval, and ③ our cdViews, to select views as input to LLAVA-OV.
  • Figure 2: Comparison of view selection methods.
  • Figure 3: The pipeline of zero-shot 3D-QA using three different view selection methods: uniform sampling (option ①), image retrieval (option ②), and our cdViews (option ③). The views marked with ★ are selected ones. As for inference, our cdViews has two modules to run: the viewSelector identifies critical views, and the viewNMS enhances view diversity and minimizes redundancy. The viewSelector is trained using automatically generated labels from the viewAnnotator module, which is detailed in Figure \ref{['fig:auto_label']}.
  • Figure 4: Performance comparison of view selection methods on the validation set of ScanQA azuma2022scanqa. It can be observed that: 1) performance improves with an increasing number of views, peaks at a certain point, and finally declines; and 2) noticeable performance gaps arise from different view selection methods, highlighting the importance of effective view selection. An earlier peak (30.1) appears in cdViews thanks to viewNMS.
  • Figure 5: Our viewAnnotator module operates in two steps: Caption Generation and View Matching (illustrated by light green boxes indicating outputs at each step). In Step 1, LVLMs processes question-answer pairs to produce detailed descriptive captions. In Step 2, these captions are compared against sampled views to assess their relevance in answering the corresponding questions. For clarity, the figure depicts only positive (A) and negative (B) view matches, excluding uncertain (C) ones.
  • ...and 5 more figures