3D Question Answering via only 2D Vision-Language Models

Fengyun Wang; Sicheng Yu; Jiawei Wu; Jinhui Tang; Hanwang Zhang; Qianru Sun

3D Question Answering via only 2D Vision-Language Models

Fengyun Wang, Sicheng Yu, Jiawei Wu, Jinhui Tang, Hanwang Zhang, Qianru Sun

TL;DR

This paper tackles 3D question answering in a zero-shot setting by using only 2D vision-language models. It introduces cdViews, a framework that jointly learns to pick critical and diverse 2D views via a viewSelector and viewNMS, avoiding explicit 3D-language alignment. Experiments on ScanQA and SQA show state-of-the-art performance against 3D and hybrid methods while maintaining computational efficiency. The work suggests that pre-trained 2D LVLMs, when paired with effective view selection, are a highly viable and efficient alternative for 3D understanding tasks.

Abstract

Large vision-language models (LVLMs) have significantly advanced numerous fields. In this work, we explore how to harness their potential to address 3D scene understanding tasks, using 3D question answering (3D-QA) as a representative example. Due to the limited training data in 3D, we do not train LVLMs but infer in a zero-shot manner. Specifically, we sample 2D views from a 3D point cloud and feed them into 2D models to answer a given question. When the 2D model is chosen, e.g., LLAVA-OV, the quality of sampled views matters the most. We propose cdViews, a novel approach to automatically selecting critical and diverse Views for 3D-QA. cdViews consists of two key components: viewSelector prioritizing critical views based on their potential to provide answer-specific information, and viewNMS enhancing diversity by removing redundant views based on spatial overlap. We evaluate cdViews on the widely-used ScanQA and SQA benchmarks, demonstrating that it achieves state-of-the-art performance in 3D-QA while relying solely on 2D models without fine-tuning. These findings support our belief that 2D LVLMs are currently the most effective alternative (of the resource-intensive 3D LVLMs) for addressing 3D tasks.

3D Question Answering via only 2D Vision-Language Models

TL;DR

Abstract

3D Question Answering via only 2D Vision-Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)