None of the Above, Less of the Right: Parallel Patterns between Humans and LLMs on Multi-Choice Questions Answering
Zhi Rui Tam, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen
TL;DR
This study investigates how None-of-the-Above (NA) options affect large language model (LLM) evaluation on multiple-choice benchmarks. By benchmarking 28 LLMs on the MMLU dataset with NA-as-Answer and NA-as-Distractor variants, the authors show a robust 30-50% accuracy drop when NA is the correct answer, with pronounced domain differences (math being relatively robust, business ethics highly affected). They provide item-quality analyses using Discrimination Index and KR-20 reliability, finding increased discrimination without sacrificing test reliability, and demonstrate that NA-robustness varies by domain and question type. The work further explores confidence and phrasing sensitivity, revealing domain-specific confidence declines and some robustness to NA wording. Importantly, targeted fine-tuning with LoRA-based methods (SFT and DPO) on NA tasks substantially improves NA handling (up to ~57.7% in NA-keyed settings) and generalizes to other benchmarks like GPQA, underscoring practical paths to mitigate NA-related weaknesses. Overall, the results imply that MCQA benchmarks designed for humans may misrepresent LLM capabilities and highlight the need for uncertainty-aware evaluation and training strategies for more reliable real-world deployment.
Abstract
Multiple-choice exam questions with "None of the above" (NA) options have been extensively studied in educational testing, in which existing research suggests that they better assess true knowledge. However, their impact on Large Language Models (LLMs) evaluation remains underexplored. Through systematic experiments with 28 LLMs on the MMLU benchmark, we examine how NA options affect model performance and confidence calibration. Our analysis reveals that NA options, when used as the correct answer, lead to a consistent 30-50\% performance drop across models regardless of scale--suggesting that LLMs lack the meta-cognitive ability to systematically evaluate and reject all given options when none are correct. This degradation shows strong domain dependence, with minimal impact on mathematical reasoning (14.6\% drop) but severe effects on tasks requiring uncertainty handling like business ethics (48.1\% drop). Our results highlight important implications for benchmark design and raise questions about LLMs' ability to handle uncertainty in real-world applications.
