Table of Contents
Fetching ...

Unexplored flaws in multiple-choice VQA evaluations

Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, Thorsten Bagodonat, Stephan Günnemann, Leo Schwinn

TL;DR

The paper identifies unexplored biases in prompt formatting for multiple-choice VQA with Multimodal LLMs, demonstrating that small, semantically neutral changes to prompt structure can drastically alter benchmark results. It conducts a large-scale study across seven MLLMs and five VQA datasets, exploring 48 prompt-format permutations to quantify effects beyond known option-order biases. Using linear mixed models, it shows significant impacts from option ID sets and delimiters, and finds that high model confidence does not shield against these biases. It further reveals that current bias-mitigation strategies (PIA, PriDe, CP-LN) fail to address these prompt-format induced biases, and it recommends open-ended evaluation and prompt-format diversification to improve reliability of VQA benchmarks.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.

Unexplored flaws in multiple-choice VQA evaluations

TL;DR

The paper identifies unexplored biases in prompt formatting for multiple-choice VQA with Multimodal LLMs, demonstrating that small, semantically neutral changes to prompt structure can drastically alter benchmark results. It conducts a large-scale study across seven MLLMs and five VQA datasets, exploring 48 prompt-format permutations to quantify effects beyond known option-order biases. Using linear mixed models, it shows significant impacts from option ID sets and delimiters, and finds that high model confidence does not shield against these biases. It further reveals that current bias-mitigation strategies (PIA, PriDe, CP-LN) fail to address these prompt-format induced biases, and it recommends open-ended evaluation and prompt-format diversification to improve reliability of VQA benchmarks.

Abstract

Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving MLLMs and VQA datasets, spanning distinct . Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.

Paper Structure

This paper contains 32 sections, 12 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Prompt template in multiple-choice VQA. A typical prompt in VQA evaluation consists of the question itself, the instruction, and different options, with their respective indexing option IDs. Option and option ID are separated by the option delimiter while the different option-option ID pairs are divided by the option separator.
  • Figure 2: Biases in multiple-choice VQA evaluation of MLLM. We compare benchmarking results on A-OKVQA using two distinct prompt formats applied to the same MLLM. Prompt Format 1 uses lowercase option IDs, double brackets as delimiters, and line breaks as separators, whereas Prompt Format 2 uses uppercase option IDs, dots as delimiters, and commas as separators. Changing these formatting choices significantly alters the ranking of models, highlighting sensitivity to prompt design.
  • Figure 3: Benchmarking results across five multiple-choice VQA datasets. The figure shows the distribution of ranks for seven MLLM across $48$ prompt format permutations. The ranks vary substantially across datasets and models, indicating high sensitivity to prompt format. Compared to an ideal evaluation (left), where ranks remain constant across formats, real-world evaluations exhibit significant variability—highlighting that prompt robustness remains a major challenge for current MLLM.
  • Figure 4: Accuracy deviations of MLLM on multiple-choice VQA datasets. We report the average deviation from the mean accuracy across all prompt formats per model-dataset pair. The results are averaged over 12 evaluations for option IDs and option delimiter, and 16 for option separator. Option position bias is mitigated via a circular evaluation scheme.
  • Figure 5: Coverage of MLLM on multiple-choice VQA datasets. The results are averaged over 12 evaluations for option IDs and option delimiter, and 16 for option separator. While models such as Gemma-3, Phi-4, Qwen-2-VL, and Qwen-2.5-VL consistently achieve coverage above $75\%$, indicating stable instruction-following capabilities, others, like LLaVA-1.5, LLaVA-OV, and Phi-3.5, exhibit significant drops under certain prompt formats, revealing strong sensitivity to formatting variations. Grey cells indicate $0\%$ coverage, meaning the models fail to produce any valid answers in these cases.
  • ...and 10 more figures