Table of Contents
Fetching ...

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy

TL;DR

The authors address inconsistencies in evaluating vision-language models when using open-ended VQA by proposing AutoConverter, a multi-agent GPT-4o-based system that automatically converts open-ended questions into challenging, correct MC items. They build VMCBench by transforming 20 VQA datasets into a unified MC format and validating a large corpus of questions with human checks. Across 33 VLMs, VMCBench demonstrates scalable, reproducible evaluation and reveals that modern public models are approaching the performance of private systems, with clear benefits from model scaling and cross-domain coverage. The work provides open-source tooling and outlines limitations (e.g., remaining errors tied to ground-truth data) and future opportunities to broaden dataset and model coverage for continued standardization of VLM benchmarking.

Abstract

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly multiple-choice question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.

Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation

TL;DR

The authors address inconsistencies in evaluating vision-language models when using open-ended VQA by proposing AutoConverter, a multi-agent GPT-4o-based system that automatically converts open-ended questions into challenging, correct MC items. They build VMCBench by transforming 20 VQA datasets into a unified MC format and validating a large corpus of questions with human checks. Across 33 VLMs, VMCBench demonstrates scalable, reproducible evaluation and reveals that modern public models are approaching the performance of private systems, with clear benefits from model scaling and cross-domain coverage. The work provides open-source tooling and outlines limitations (e.g., remaining errors tied to ground-truth data) and future opportunities to broaden dataset and model coverage for continued standardization of VLM benchmarking.

Abstract

The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation. However, current visual question answering (VQA) benchmarks often depend on open-ended questions, making accurate evaluation difficult due to the variability in natural language responses. To address this, we introduce AutoConverter, an agentic framework that automatically converts these open-ended questions into multiple-choice format, enabling objective evaluation while reducing the costly multiple-choice question creation process. Our experiments demonstrate that AutoConverter can generate correct and challenging multiple-choice questions, with VLMs demonstrating consistently similar or lower accuracy on these questions compared to human-created ones. Using AutoConverter, we construct VMCBench, a benchmark created by transforming 20 existing VQA datasets into a unified multiple-choice format, totaling 9,018 questions. We comprehensively evaluate 33 state-of-the-art VLMs on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
Paper Structure (28 sections, 19 figures, 12 tables)

This paper contains 28 sections, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Overview.(Left) We analyze existing open-ended VQA evaluation metrics, underscoring their limitations in providing accurate and reproducible assessments. (Middle) We introduce AutoConverter, a multi-agent system that automatically converts open-ended questions into multiple-choice format, enabling objective assessment while reducing the costly question creation process. (Right) Using AutoConverter, we convert and refine 20 existing VQA datasets into a unified multiple-choice benchmark to support future VLM research.
  • Figure 2: Challenges in evaluating open-ended questions.(Left) Rule-based metrics significantly underestimate model performance and penalize models that do not strictly follow the expected format. (Right) Model-based evaluations using two different versions of GPT yield substantially different scores, making comparisons inconsistent and raising reproducibility issues. The repeated points represent different model sizes within the same model family (e.g., GPT-4o/GPT-4o-Mini for 2 points).
  • Figure 3: AutoConverter framework and results.(Left)AutoConverter is a multi-agent framework with two key steps: increasing difficulty and ensuring the correctness of the converted question. (Right) We perform an ablation study on AutoConverter and find that each component is crucial for enhancing question correctness and achieving the desired level of difficulty.
  • Figure 4: AutoConverter generates challenging multiple-choice questions. Using AutoConverter, we generated distractors for questions and answers from three existing multiple-choice datasets: MMMU, MathVista, and AI2D, and compared them with original human-created distractors. We evaluated various VLMs on both the AutoConverter-generated and the original questions, finding that VLMs consistently achieved similar or even lower accuracy on the AutoConverter-generated questions compared to the original ones.
  • Figure 5: Qualitative comparison of the original questions, naive baseline-generated questions, and AutoConverter-generated questions.AutoConverter simulates errors from different perspectives and produces correct and challenging multiple-choice questions.
  • ...and 14 more figures