Table of Contents
Fetching ...

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Nishant Balepur, Abhilasha Ravichander, Rachel Rudinger

TL;DR

This work questions whether MCQA accuracy for large language models truly reflects reasoning or merely exploits dataset artifacts. It introduces choices-only prompting as a robust probing method and evaluates four open-source LLMs across ARC, MMLU, and HellaSwag, exposing significant choices-only performance in most cases. By testing memorization, choice dynamics, and abductive question inference, the paper demonstrates that memorization cannot fully explain high accuracy, priors are insufficient alone, and abductive inference accounts for some but not all variance. The authors advocate stronger baselines and robust dataset designs to ensure MCQA benchmarks measure genuine capabilities rather than artifacts, and they provide a transparent analysis suite to advance trustworthy LLM evaluation.

Abstract

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

TL;DR

This work questions whether MCQA accuracy for large language models truly reflects reasoning or merely exploits dataset artifacts. It introduces choices-only prompting as a robust probing method and evaluates four open-source LLMs across ARC, MMLU, and HellaSwag, exposing significant choices-only performance in most cases. By testing memorization, choice dynamics, and abductive question inference, the paper demonstrates that memorization cannot fully explain high accuracy, priors are insufficient alone, and abductive inference accounts for some but not all variance. The authors advocate stronger baselines and robust dataset designs to ensure MCQA benchmarks measure genuine capabilities rather than artifacts, and they provide a transparent analysis suite to advance trustworthy LLM evaluation.

Abstract

Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.
Paper Structure (44 sections, 1 equation, 14 figures, 9 tables)

This paper contains 44 sections, 1 equation, 14 figures, 9 tables.

Figures (14)

  • Figure 2: LLM accuracy with full prompts versus the partial-input choices-only prompts. An asterisk (*) denotes that the choices-only prompt significantly outperforms the majority class baseline (two-sample $t$-test, $p < 5\text{e-}5$).
  • Figure 3: LLM accuracy with full prompts versus our three tested memorization prompts. An asterisk (*) denotes that the choices-only prompt significantly outperforms the majority class baseline (two-sample $t$-test, $p < 5\text{e-}5$).
  • Figure 4: Scoring of our LLMs with group full and choices-only prompts (dark color, no pattern) versus their individual counterparts (light color, striped). An asterisk (*) denotes the prompt significantly outperforms the majority class baseline (two-sample $t$-test, $p < 5\text{e-}5$). We omit Falcon due to very low accuracy (see Appendix \ref{['appendix:falcon_ind']}).
  • Figure 5: Accuracy of LLMs when performing MCQA with their own inferred question versus a random question. Due to GPU constraints stemming from long questions, we only evaluate on 75% of the MMLU evaluation set.
  • Figure 6: Accuracy of LLaMA-2 models of different sizes (7B, 13B, 70B) with full prompts versus choices-only prompts on ARC, MMLU, and HellaSwag. An asterisk (*) denotes that the choices-only prompt significantly outperforms the majority class baseline (two-sample t-test, $p < 5\text{e-}5$). As the LLM scales in parameters, it obtains higher choices-only accuracy.
  • ...and 9 more figures