Table of Contents
Fetching ...

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

TL;DR

WiCkeD introduces a simple, automatic method to harden existing MCQ benchmarks by replacing one option with a wild-card 'None of the above'. Applied to six benchmarks across 18 open-weight LLMs, WiCkeD induces substantial performance drops and rearranges model rankings, signaling deeper knowledge and reasoning gaps not captured by original datasets. Chain-of-thought prompting does not fully alleviate the difficulty, with robustness improvements mainly seen in instruction-tuned or more capable models. The work provides a practical, scalable tool for robust benchmark evaluation and releases code and data for community use.

Abstract

We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

TL;DR

WiCkeD introduces a simple, automatic method to harden existing MCQ benchmarks by replacing one option with a wild-card 'None of the above'. Applied to six benchmarks across 18 open-weight LLMs, WiCkeD induces substantial performance drops and rearranges model rankings, signaling deeper knowledge and reasoning gaps not captured by original datasets. Chain-of-thought prompting does not fully alleviate the difficulty, with robustness improvements mainly seen in instruction-tuned or more capable models. The work provides a practical, scalable tool for robust benchmark evaluation and releases code and data for community use.

Abstract

We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

Paper Structure

This paper contains 17 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Two samples from MMLU-Pro (left) and its WiCkeD variant (right), where Hydrogen and Centrifugal were removed. Correct answers in bold. Llama-3.1 8B correctly answers both original questions but fails on the WiCkeD variant for the second question. The probability distribution of the model for each answer is also shown.
  • Figure 2: Applying WiCkeD on a single best answer (SBA) example (best answer D, second best answer A) would lead to an incoherent WiCkeD variant (incorrectly having None of the above as the gold correct answer instead of A). We thus copy SBA examples verbatim, see § \ref{['sec:coherence']} for details.
  • Figure 3: The changes in models' answers of the original benchmarks and the WiCkeD variant using chain-of-thoughts.
  • Figure 4: Examples from the MMLU computer science task using WiCkeD. We show 3-shot for brevity, but 5-shot was actually used in the experiments for the main results.
  • Figure 5: Examples from the AllenAi Arc challenge using WiCkeD. We show 3-shot for brevity, but 5-shot was actually used in the experiments for the main results. The first few-shot example does not include None of the above option because it was classified as SBA question.
  • ...and 2 more figures