Table of Contents
Fetching ...

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu

TL;DR

VisualSimpleQA tackles factuality challenges in LVLMs by enabling decoupled evaluation of visual and linguistic modules through paired multimodal and text-only questions plus ROI rationales, and by defining a formal difficulty score that yields a hard subset VisualSimpleQA-hard. The benchmark is validated across 15 frontier LVLMs, revealing substantial gaps even for top models (e.g., GPT-4o attaining only around 60% correctness on multimodal QA and ~30% on VisualSimpleQA-hard) and demonstrating clear opportunities for improving both modalities via relative degradation analysis. The paper also details a rigorous annotation and verification pipeline to ensure high-quality, diverse data, including 500 samples with 129 hard cases and 200 newly collected images to mitigate dataset bias. Overall, VisualSimpleQA provides a practical framework for diagnosing and guiding improvements in factuality for multimodal QA systems and offers a publicly accessible dataset for the community.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

TL;DR

VisualSimpleQA tackles factuality challenges in LVLMs by enabling decoupled evaluation of visual and linguistic modules through paired multimodal and text-only questions plus ROI rationales, and by defining a formal difficulty score that yields a hard subset VisualSimpleQA-hard. The benchmark is validated across 15 frontier LVLMs, revealing substantial gaps even for top models (e.g., GPT-4o attaining only around 60% correctness on multimodal QA and ~30% on VisualSimpleQA-hard) and demonstrating clear opportunities for improving both modalities via relative degradation analysis. The paper also details a rigorous annotation and verification pipeline to ensure high-quality, diverse data, including 500 samples with 129 hard cases and 200 newly collected images to mitigate dataset bias. Overall, VisualSimpleQA provides a practical framework for diagnosing and guiding improvements in factuality for multimodal QA systems and offers a publicly accessible dataset for the community.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

Paper Structure

This paper contains 23 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Illustration of an example in VisualSimpleQA. The red box highlights the region of interest (ROI). Each sample has several attributes and tags, which allow us to measure its overall difficulty score based on our proposed difficulty criteria.
  • Figure 2: Decoupled evaluation process.
  • Figure 3: Flowchart of the annotation process. Evidence is used to guarantee the correctness of the answer, while ROI is annotated to calculate the difficulty of each sample.
  • Figure 4: Distribution of topics in VisualSimpleQA.
  • Figure 5: Distributions of factors that influence the difficulty of visual recognition. TI denotes Text in Image.
  • ...and 8 more figures