Choices Speak Louder than Questions
Gyeongje Cho, Yeonkyoung So, Jaejin Lee
TL;DR
The paper addresses the reliability of MCQA as a measure of language understanding by showing that model decisions can be driven by answer-choice biases rather than genuine question comprehension. It formalizes choice sensitivity, decomposing per-choice scores into choice-driven and question-driven components, and introduces Normalized Probability Shift by the Question (NPSQ) to isolate the question's impact. Through experiments across cloze, symbols, and hybrid formats and multiple model families, it demonstrates that traditional log-likelihood based metrics are vulnerable to surface features of answer choices, while NPSQ provides a stable, interpretable assessment of true comprehension. The findings highlight the need for robust evaluation methods in MCQA and show that instruction tuning can further mitigate choice sensitivity, supporting broader adoption of NPSQ for more reliable benchmarking.
Abstract
Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of choice sensitivity, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods - such as those based on log-likelihood or its length-normalized variant - are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.
