Table of Contents
Fetching ...

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis

TL;DR

The paper evaluates zero-shot GPT-4V/GPT-4 on 3D-VQA benchmarks to understand open-vocabulary visual reasoning in 3D scenes. It employs a GPT-V captioner to describe frames and GPT-4 Turbo to answer questions, comparing open-vocabulary and vocabulary-grounded prompts across Blind and Socratic agent types. Key findings show that blind GPT agents are competitive with closed-vocabulary baselines and that scene-specific vocabulary improves captioning, while RGB-frame inputs substantially boost performance over mesh renders; frame sampling and batching reveal clear efficiency-accuracy trade-offs. The work highlights the potential need to reformulate 3D-VQA benchmarks for foundation models and proposes directions for improved prompting, grounding, and multi-frame in-context evaluation.

Abstract

As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community's ongoing efforts to refine multi-modal 3D benchmarks.

Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks

TL;DR

The paper evaluates zero-shot GPT-4V/GPT-4 on 3D-VQA benchmarks to understand open-vocabulary visual reasoning in 3D scenes. It employs a GPT-V captioner to describe frames and GPT-4 Turbo to answer questions, comparing open-vocabulary and vocabulary-grounded prompts across Blind and Socratic agent types. Key findings show that blind GPT agents are competitive with closed-vocabulary baselines and that scene-specific vocabulary improves captioning, while RGB-frame inputs substantially boost performance over mesh renders; frame sampling and batching reveal clear efficiency-accuracy trade-offs. The work highlights the potential need to reformulate 3D-VQA benchmarks for foundation models and proposes directions for improved prompting, grounding, and multi-frame in-context evaluation.

Abstract

As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community's ongoing efforts to refine multi-modal 3D benchmarks.
Paper Structure (17 sections, 1 equation, 1 figure, 6 tables)

This paper contains 17 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Illustration of a zero-shot GPT-4V agent answering 3D Visual Question Answering (VQA) questions. We present a preliminary investigation of how recently introduced open-vocabulary LLMs perform against older well-established closed-vocabulary benchmarks, namely 3D-VQAetesam20223dvqa and ScanQAazuma2022scanQA. To stimulate future research and alleviate the extensive computational cost, we will be releasing all prompts, scene captions, GPT responses and results across all ScanNet scenes in our project repo: Code available upon acceptance.