Table of Contents
Fetching ...

Choices Speak Louder than Questions

Gyeongje Cho, Yeonkyoung So, Jaejin Lee

TL;DR

The paper addresses the reliability of MCQA as a measure of language understanding by showing that model decisions can be driven by answer-choice biases rather than genuine question comprehension. It formalizes choice sensitivity, decomposing per-choice scores into choice-driven and question-driven components, and introduces Normalized Probability Shift by the Question (NPSQ) to isolate the question's impact. Through experiments across cloze, symbols, and hybrid formats and multiple model families, it demonstrates that traditional log-likelihood based metrics are vulnerable to surface features of answer choices, while NPSQ provides a stable, interpretable assessment of true comprehension. The findings highlight the need for robust evaluation methods in MCQA and show that instruction tuning can further mitigate choice sensitivity, supporting broader adoption of NPSQ for more reliable benchmarking.

Abstract

Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of choice sensitivity, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods - such as those based on log-likelihood or its length-normalized variant - are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.

Choices Speak Louder than Questions

TL;DR

The paper addresses the reliability of MCQA as a measure of language understanding by showing that model decisions can be driven by answer-choice biases rather than genuine question comprehension. It formalizes choice sensitivity, decomposing per-choice scores into choice-driven and question-driven components, and introduces Normalized Probability Shift by the Question (NPSQ) to isolate the question's impact. Through experiments across cloze, symbols, and hybrid formats and multiple model families, it demonstrates that traditional log-likelihood based metrics are vulnerable to surface features of answer choices, while NPSQ provides a stable, interpretable assessment of true comprehension. The findings highlight the need for robust evaluation methods in MCQA and show that instruction tuning can further mitigate choice sensitivity, supporting broader adoption of NPSQ for more reliable benchmarking.

Abstract

Recent findings raise concerns about whether the evaluation of Multiple-Choice Question Answering (MCQA) accurately reflects the comprehension abilities of large language models. This paper explores the concept of choice sensitivity, which refers to the tendency for model decisions to be more influenced by the answer options than by a genuine understanding of the question. We introduce a new scoring method called Normalized Probability Shift by the Question (NPSQ), designed to isolate the impact of the question itself and provide a more reliable assessment of comprehension. Through experiments involving various input formats, including cloze, symbols, and hybrid formats, we find that traditional scoring methods - such as those based on log-likelihood or its length-normalized variant - are vulnerable to superficial characteristics of the answer choices. In contrast, NPSQ remains stable even when modifications are made to the answer options.

Paper Structure

This paper contains 19 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Choice sensitivity across model sizes and few-shot examples.
  • Figure 2: The impact of instruction tuning and task instructions on choice sensitivity.
  • Figure 3: Impact of adversarial choices on Llama3.1-8B-Instruct.
  • Figure 4: Accuracy across various formats and scoring methods for the models. Results are outlined for the cloze, symbols, and hybrid formats using three metrics: acc (log-likelihood), acc_norm (length-normalized log-likelihood), and acc_npsq (NPSQ).
  • Figure 5: Three MCQA input formats considered in this study, following the categorization of alzahrani2024benchmarks: cloze, symbols, and hybrid formats. Here, $Q$, $C$, and $x$ correspond to the question-related input, the choice-related input, and a specific answer candidate, respectively. Instruction refers to task-defining text (e.g., "Answer the given question"), while prefix refers to fixed labels in the prompt (e.g., "Question:", "Answer:") used to structure the input.
  • ...and 3 more figures