VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Yanling Wang; Yihan Zhao; Xiaodong Chen; Shasha Guo; Lixin Liu; Haoyang Li; Yong Xiao; Jing Zhang; Qi Li; Ke Xu

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

Yanling Wang, Yihan Zhao, Xiaodong Chen, Shasha Guo, Lixin Liu, Haoyang Li, Yong Xiao, Jing Zhang, Qi Li, Ke Xu

TL;DR

VisualSimpleQA tackles factuality challenges in LVLMs by enabling decoupled evaluation of visual and linguistic modules through paired multimodal and text-only questions plus ROI rationales, and by defining a formal difficulty score that yields a hard subset VisualSimpleQA-hard. The benchmark is validated across 15 frontier LVLMs, revealing substantial gaps even for top models (e.g., GPT-4o attaining only around 60% correctness on multimodal QA and ~30% on VisualSimpleQA-hard) and demonstrating clear opportunities for improving both modalities via relative degradation analysis. The paper also details a rigorous annotation and verification pipeline to ensure high-quality, diverse data, including 500 samples with 129 hard cases and 200 newly collected images to mitigate dataset bias. Overall, VisualSimpleQA provides a practical framework for diagnosing and guiding improvements in factuality for multimodal QA systems and offers a publicly accessible dataset for the community.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at https://huggingface.co/datasets/WYLing/VisualSimpleQA.

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

TL;DR

Abstract

VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)