Table of Contents
Fetching ...

None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering

Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen

TL;DR

This study investigates how None-of-the-Above (NA) options affect large language model (LLM) evaluation on multiple-choice benchmarks. By benchmarking 28 LLMs on the MMLU dataset with NA-as-Answer and NA-as-Distractor variants, the authors show a robust 30-50% accuracy drop when NA is the correct answer, with pronounced domain differences (math being relatively robust, business ethics highly affected). They provide item-quality analyses using Discrimination Index and KR-20 reliability, finding increased discrimination without sacrificing test reliability, and demonstrate that NA-robustness varies by domain and question type. The work further explores confidence and phrasing sensitivity, revealing domain-specific confidence declines and some robustness to NA wording. Importantly, targeted fine-tuning with LoRA-based methods (SFT and DPO) on NA tasks substantially improves NA handling (up to ~57.7% in NA-keyed settings) and generalizes to other benchmarks like GPQA, underscoring practical paths to mitigate NA-related weaknesses. Overall, the results imply that MCQA benchmarks designed for humans may misrepresent LLM capabilities and highlight the need for uncertainty-aware evaluation and training strategies for more reliable real-world deployment.

Abstract

Multiple-choice exam questions with "None of the above" (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50\% performance drop across models regardless of scale--suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6\% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1\% drop). Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.

None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering

TL;DR

This study investigates how None-of-the-Above (NA) options affect large language model (LLM) evaluation on multiple-choice benchmarks. By benchmarking 28 LLMs on the MMLU dataset with NA-as-Answer and NA-as-Distractor variants, the authors show a robust 30-50% accuracy drop when NA is the correct answer, with pronounced domain differences (math being relatively robust, business ethics highly affected). They provide item-quality analyses using Discrimination Index and KR-20 reliability, finding increased discrimination without sacrificing test reliability, and demonstrate that NA-robustness varies by domain and question type. The work further explores confidence and phrasing sensitivity, revealing domain-specific confidence declines and some robustness to NA wording. Importantly, targeted fine-tuning with LoRA-based methods (SFT and DPO) on NA tasks substantially improves NA handling (up to ~57.7% in NA-keyed settings) and generalizes to other benchmarks like GPQA, underscoring practical paths to mitigate NA-related weaknesses. Overall, the results imply that MCQA benchmarks designed for humans may misrepresent LLM capabilities and highlight the need for uncertainty-aware evaluation and training strategies for more reliable real-world deployment.

Abstract

Multiple-choice exam questions with "None of the above" (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50\% performance drop across models regardless of scale--suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6\% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1\% drop). Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.

Paper Structure

This paper contains 52 sections, 2 equations, 23 figures, 7 tables.

Figures (23)

  • Figure 1: Example of LLMs confused in "None of above" in gpt-4o-2024-11-20 despite knowing both DNA and Triglycerides as non steroid molecules.
  • Figure 2: Replacing the answer "a fetus" to None of the above would prompt LLMs to choose a more suitable option "an embryo" since embryo is simply the previous stage to fetus.
  • Figure 3: Questions in Moral scenario are mostly about vague settings which are not suitable for NA setting which violates the factual verification rule.
  • Figure 4: Percentage of questions where NA is applicable over 56 MMLU subjects (deduct moral scenario). STEM subjects show the highest average applicability ratio (0.731), followed by Humanities (0.570), Others (0.553), and Social Sciences (0.496). College-level subjects, particularly in Chemistry and Physics, demonstrate the highest individual ratios, while subjects like Security Studies and Moral Disputes show the lowest applicability.
  • Figure 5: The left panel compares LLM performance on standard questions and on questions where the answer is replaced with "None of the Above". The right panel demonstrates that adding NA as an extra distractor leads to results similar to the baseline.
  • ...and 18 more figures