Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman; Taylor Lundy; Kevin Leyton-Brown

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman, Taylor Lundy, Kevin Leyton-Brown

TL;DR

This work interrogates whether MCQA scores truly reflect reasoning in LLMs or merely exploit test structure. By systematically evaluating 15 benchmarks and 27 models across five input/response formats, it quantifies the extent to which option presentation boosts performance independent of genuine reasoning. The study finds that MCQA can remain a strong proxy for downstream ability when chain-of-thought reasoning occurs before seeing options, but large models can leverage options after presentation to inflate scores, with NOTA and format variations modulating this effect. The authors offer concrete design guidelines to separate reasoning from test exploitation and to calibrate benchmarks for more faithful assessments of LLM capabilities, which is crucial for reliable benchmarking and deployment.

Abstract

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI's o3). For each model--benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only \emph{before} being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning \emph{after} being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

TL;DR

Abstract

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)