Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

Xinpeng Wang; Chengzhi Hu; Bolei Ma; Paul Röttger; Barbara Plank

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank

TL;DR

This work examines how to robustly evaluate MCQs answered by instruction-tuned LLMs, comparing text-based extractions against probability-based first-token scoring and PriDe debiasing. It introduces a text-answer classifier trained on open-model responses to extract MCQ choices and evaluates robustness using metrics for selection bias ($R_{StD}$) and answer entropy under multiple perturbations. Across multiple models and tasks (notably MMLU), text-based answers show lower bias and greater stability, with robustness gains growing as the mismatch between text and first-token answers increases; in many cases, text-based evaluation even outperforms PriDe when the mismatch exceeds 50%. The findings advocate for direct text-answer evaluation as a more faithful and reliable measure of LLM behavior in MCQ settings, highlighting practical implications for benchmark design and model assessment. Overall, the paper demonstrates that instruction-tuned language models exhibit substantial robustness in text answers, challenging the assumption that first-token probabilities are the primary indicator of MCQ performance.

Abstract

Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50\%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

TL;DR

) and answer entropy under multiple perturbations. Across multiple models and tasks (notably MMLU), text-based answers show lower bias and greater stability, with robustness gains growing as the mismatch between text and first-token answers increases; in many cases, text-based evaluation even outperforms PriDe when the mismatch exceeds 50%. The findings advocate for direct text-answer evaluation as a more faithful and reliable measure of LLM behavior in MCQ settings, highlighting practical implications for benchmark design and model assessment. Overall, the paper demonstrates that instruction-tuned language models exhibit substantial robustness in text answers, challenging the assumption that first-token probabilities are the primary indicator of MCQ performance.

Abstract

Paper Structure (30 sections, 2 equations, 14 figures, 9 tables)

This paper contains 30 sections, 2 equations, 14 figures, 9 tables.

Introduction
Mismatch between the probability and the text-based MCQ evaluation
Robustness evaluation of the probability and text-based approaches
Experimental setup
Models
Benchmark
Prompting
Probability-based evalution
Text-based evaluation
Annotation scheme and classifier training
Metrics
Standard deviation of recalls
Entropy
Selection bias result
Perturbations
...and 15 more sections

Figures (14)

Figure 1: An example mismatch case between the first token probabilities and the text answer given by the Llama2-7b-Chat model.
Figure 2: Accuracy and selection bias results. A lower RStD score means a smaller selection bias. As the mismatch rate decreases from Gemma ($56.8\%$) to Mistral ($10.2\%$), the performance gap between the first token (red) and text answer (blue) decreases. Text answers from Gemma and Llama2 have lower selection bias than the debiased first token answers.
Figure 3: Answer Floating Rate. Text answers are more robust to adding options, except Mistral.
Figure 4: Answer distribution before and after adding additional options. Text answers show less distribution shift after adding additional options. Note that the text answers are not limited to the given options in the original options setting.
Figure 5: RStD of Llama2-7b-Chat in selected subcategories.
...and 9 more figures

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

TL;DR

Abstract

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

Authors

TL;DR

Abstract

Table of Contents

Figures (14)