Table of Contents
Fetching ...

An Evaluation of GPT-4V and Gemini in Online VQA

Mengchen Liu, Chongyan Chen, Danna Gurari

TL;DR

This work conducts a comprehensive zero-shot evaluation of GPT-4V and Gemini on the VQAonline dataset, an ecologically valid visual QA collection sourced from Stack Exchange. By extracting seven metadata dimensions (topic, super-topic, user intention, image processing capabilities, image type, difficulty, and knowledge type) for ~1,903 questions, the authors dissect model performance across diverse information needs. They find GPT-4V generally outperforms Gemini (average accuracies around 0.53 vs. 0.42) and reveal topic-, capability-, and knowledge-type–dependent strengths and weaknesses, such as GPT-4V excelling in science-focused topics but struggling with identification and puzzle-related questions, while Gemini shows advantages in scene understanding and certain knowledge domains. The study underscores the importance of richer metadata in diagnosing LMM capabilities and guides future work toward multi-metadata-aware evaluations and broader dataset coverage to advance multimodal reasoning in real-world, online contexts.

Abstract

While there is much excitement about the potential of large multimodal models (LMM), a comprehensive evaluation is critical to establish their true capabilities and limitations. In support of this aim, we evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset sourced from an authentic online question answering community. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions, such as image type and the required image processing capabilities. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models, including questions related to "puzzling" topic, with "Identification" user intention, with "Sheet Music" image type, or labeled as "hard" by GPT-4.

An Evaluation of GPT-4V and Gemini in Online VQA

TL;DR

This work conducts a comprehensive zero-shot evaluation of GPT-4V and Gemini on the VQAonline dataset, an ecologically valid visual QA collection sourced from Stack Exchange. By extracting seven metadata dimensions (topic, super-topic, user intention, image processing capabilities, image type, difficulty, and knowledge type) for ~1,903 questions, the authors dissect model performance across diverse information needs. They find GPT-4V generally outperforms Gemini (average accuracies around 0.53 vs. 0.42) and reveal topic-, capability-, and knowledge-type–dependent strengths and weaknesses, such as GPT-4V excelling in science-focused topics but struggling with identification and puzzle-related questions, while Gemini shows advantages in scene understanding and certain knowledge domains. The study underscores the importance of richer metadata in diagnosing LMM capabilities and guides future work toward multi-metadata-aware evaluations and broader dataset coverage to advance multimodal reasoning in real-world, online contexts.

Abstract

While there is much excitement about the potential of large multimodal models (LMM), a comprehensive evaluation is critical to establish their true capabilities and limitations. In support of this aim, we evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset sourced from an authentic online question answering community. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions, such as image type and the required image processing capabilities. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models, including questions related to "puzzling" topic, with "Identification" user intention, with "Sheet Music" image type, or labeled as "hard" by GPT-4.
Paper Structure (29 sections, 21 figures, 7 tables)

This paper contains 29 sections, 21 figures, 7 tables.

Figures (21)

  • Figure 1: GPT-4V Accuracy.
  • Figure 2: Gemini Accuracy.
  • Figure 3: Accuracy: GPT-4V - Gemini.
  • Figure 4: Accuracy comparison between GPT-4V and Gemini across super-topics.
  • Figure 5: Correlation of topic-level accuracy between GPT-4V and Gemini. Each point is a topic, where its X-, and Y-coordinates are the accuracy in GPT-4V and Gemini, respectively.
  • ...and 16 more figures