Table of Contents
Fetching ...

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman, Taylor Lundy, Kevin Leyton-Brown

TL;DR

This work interrogates whether MCQA scores truly reflect reasoning in LLMs or merely exploit test structure. By systematically evaluating 15 benchmarks and 27 models across five input/response formats, it quantifies the extent to which option presentation boosts performance independent of genuine reasoning. The study finds that MCQA can remain a strong proxy for downstream ability when chain-of-thought reasoning occurs before seeing options, but large models can leverage options after presentation to inflate scores, with NOTA and format variations modulating this effect. The authors offer concrete design guidelines to separate reasoning from test exploitation and to calibrate benchmarks for more faithful assessments of LLM capabilities, which is crucial for reliable benchmarking and deployment.

Abstract

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI's o3). For each model--benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only \emph{before} being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning \emph{after} being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

TL;DR

This work interrogates whether MCQA scores truly reflect reasoning in LLMs or merely exploit test structure. By systematically evaluating 15 benchmarks and 27 models across five input/response formats, it quantifies the extent to which option presentation boosts performance independent of genuine reasoning. The study finds that MCQA can remain a strong proxy for downstream ability when chain-of-thought reasoning occurs before seeing options, but large models can leverage options after presentation to inflate scores, with NOTA and format variations modulating this effect. The authors offer concrete design guidelines to separate reasoning from test exploitation and to calibrate benchmarks for more faithful assessments of LLM capabilities, which is crucial for reliable benchmarking and deployment.

Abstract

When evaluating Large Language Models (LLMs) in question answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of 15 different question-answering benchmarks (e.g., MMLU, GSM8K) and 27 different LLMs (including small models such as Qwen-2.5 7B, mid-sized models such as Llama-3.3 70B, and large state-of-the-art models such as OpenAI's o3). For each model--benchmark pair, we considered 5 ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether "none of the above" sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only \emph{before} being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning \emph{after} being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs' genuine reasoning capabilities.

Paper Structure

This paper contains 31 sections, 1 equation, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Pass@1 accuracy of each LLM on the set of CoT-extractable questions in the benchmark suite over QMC-CoT (dark) and Q-CoT (light). LLMs are grouped into reasoning models (red) and non-reasoning models (blue), sorted by parameter count. Beneath every Q-CoT bar, we plot the boost in accuracy Q-CoT would have gotten with random guessing denoted Q-CoT+$k$.
  • Figure 2: The amount of exploitation by each LLM on the set of CoT-extractable questions in the benchmark suite. Reasoning models are in red and non-reasoning models in blue.
  • Figure 3: The MC normalized accuracy of non-reasoning models (Qwen3 models) on QMC-CoT in dark blue (dark red) and non-reasoning models (Qwen3 thinking mode off in the second step) super-scored on Q-CoT and Q-CoT-MC-1T in light blue (light red). LLMs are sorted by $E_{\mathrm{QMC\xspace}}$.
  • Figure 4: The normalized MC-only exploitation of all models on MMLU and MMLU-Pro. Reasoning models are hatched.
  • Figure 6: This figure plots the percentage of questions (by subject) that passed the filters we ran on the MMLU portion of the Open-LLM benchmark. We note that there was not a systematic removal of "reasoning" subjects over answer retrieval subjects.
  • ...and 5 more figures