Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank
TL;DR
This work examines how to robustly evaluate MCQs answered by instruction-tuned LLMs, comparing text-based extractions against probability-based first-token scoring and PriDe debiasing. It introduces a text-answer classifier trained on open-model responses to extract MCQ choices and evaluates robustness using metrics for selection bias ($R_{StD}$) and answer entropy under multiple perturbations. Across multiple models and tasks (notably MMLU), text-based answers show lower bias and greater stability, with robustness gains growing as the mismatch between text and first-token answers increases; in many cases, text-based evaluation even outperforms PriDe when the mismatch exceeds 50%. The findings advocate for direct text-answer evaluation as a more faithful and reliable measure of LLM behavior in MCQ settings, highlighting practical implications for benchmark design and model assessment. Overall, the paper demonstrates that instruction-tuned language models exhibit substantial robustness in text answers, challenging the assumption that first-token probabilities are the primary indicator of MCQ performance.
Abstract
Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50\%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.
