Table of Contents
Fetching ...

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Nishant Balepur, Bhavya Rajasekaran, Jane Oh, Michael Xie, Atrey Desai, Vipul Gupta, Steven James Moore, Eunsol Choi, Rachel Rudinger, Jordan Lee Boyd-Graber

TL;DR

BenchMarker introduces an education-inspired toolkit to flag three MCQA flaws—contamination, shortcuts, and writing errors—in NLP benchmarks using LLM judges. It validates reliability against human annotations across 12 benchmarks, showing contamination inflates accuracy while writing flaws depress it and can shift model rankings. The study demonstrates that prior benchmark fixes may reduce targeted flaws but can introduce new issues, underscoring the need for iterative, education-grounded design in MCQA benchmarks. The authors provide a public toolkit and validation datasets to guide future improvement of MCQA design and evaluation in NLP and education alike.

Abstract

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

TL;DR

BenchMarker introduces an education-inspired toolkit to flag three MCQA flaws—contamination, shortcuts, and writing errors—in NLP benchmarks using LLM judges. It validates reliability against human annotations across 12 benchmarks, showing contamination inflates accuracy while writing flaws depress it and can shift model rankings. The study demonstrates that prior benchmark fixes may reduce targeted flaws but can introduce new issues, underscoring the need for iterative, education-grounded design in MCQA benchmarks. The authors provide a public toolkit and validation datasets to guide future improvement of MCQA design and evaluation in NLP and education alike.

Abstract

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination - items appearing exactly online; 2) shortcuts - cues in the choices that enable guessing; and 3) writing errors - structural/grammatical issues based on a 19-rule education rubric. We validate BenchMarker with human annotations, then run the tool to audit 12 benchmarks, revealing: 2) contaminated MCQs tend to inflate accuracy, while writing errors tend to lower it and change rankings beyond random; and 3) prior benchmark repairs address their targeted issues (i.e., lowering accuracy with LLM-written distractors), but inadvertently add new flaws (i.e. implausible distractors, many correct answers). Overall, flaws in MCQs degrade NLP evaluation, but education research offers a path forward. We release BenchMarker to bridge the fields and improve MCQA benchmark design.
Paper Structure (29 sections, 6 figures, 12 tables)

This paper contains 29 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 2: Prevalence of flaws in mcqa benchmarks, grouped by whether the mcqs originate from student assessments. While mcqs from exam-based benchmarks are more commonly found online (top left), they contain far fewer writing flaws (bottom).
  • Figure 3: The five most common writing errors BenchMarker predicts in mcqa benchmarks, grouped by whether they stem from student exams. Most flaws relate to clarity and distractor difficulty. Appendix \ref{['appendix:writing_flaw_results']} has the full distribution of 19 flaws.
  • Figure 4: Interface from InspectAI for viewing BenchMarker runs. The overall scores are in the top right, and researchers can click on specific mcqs to view llm judge calls and feedback, supporting debugging and analysis.
  • Figure 5: Scores for each of the 19 writing flaws across each mcqa benchmark.
  • Figure 6: Descending prevalence of all $19$ writing flaws across exam-based and non-exam-based mcqa benchmarks.
  • ...and 1 more figures