Evaluating Zero-Shot GPT-4V Performance on 3D Visual Question Answering Benchmarks
Simranjit Singh, Georgios Pavlakos, Dimitrios Stamoulis
TL;DR
The paper evaluates zero-shot GPT-4V/GPT-4 on 3D-VQA benchmarks to understand open-vocabulary visual reasoning in 3D scenes. It employs a GPT-V captioner to describe frames and GPT-4 Turbo to answer questions, comparing open-vocabulary and vocabulary-grounded prompts across Blind and Socratic agent types. Key findings show that blind GPT agents are competitive with closed-vocabulary baselines and that scene-specific vocabulary improves captioning, while RGB-frame inputs substantially boost performance over mesh renders; frame sampling and batching reveal clear efficiency-accuracy trade-offs. The work highlights the potential need to reformulate 3D-VQA benchmarks for foundation models and proposes directions for improved prompting, grounding, and multi-frame in-context evaluation.
Abstract
As interest in "reformulating" the 3D Visual Question Answering (VQA) problem in the context of foundation models grows, it is imperative to assess how these new paradigms influence existing closed-vocabulary datasets. In this case study, we evaluate the zero-shot performance of foundational models (GPT-4 Vision and GPT-4) on well-established 3D VQA benchmarks, namely 3D-VQA and ScanQA. We provide an investigation to contextualize the performance of GPT-based agents relative to traditional modeling approaches. We find that GPT-based agents without any fine-tuning perform on par with the closed vocabulary approaches. Our findings corroborate recent results that "blind" models establish a surprisingly strong baseline in closed-vocabulary settings. We demonstrate that agents benefit significantly from scene-specific vocabulary via in-context textual grounding. By presenting a preliminary comparison with previous baselines, we hope to inform the community's ongoing efforts to refine multi-modal 3D benchmarks.
