Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta; Nishant Balepur; Peter Rankel; Sarah Wiegreffe; Marine Carpuat; Rachel Rudinger

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta, Nishant Balepur, Peter Rankel, Sarah Wiegreffe, Marine Carpuat, Rachel Rudinger

TL;DR

It is found that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, it is confirmed that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices.

Abstract

Questions involving commonsense reasoning about everyday situations often admit many $\textit{possible}$ or $\textit{plausible}$ answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the $\textit{most}$ plausible answer choice. On $250$ MCQ items sampled from two commonsense reasoning benchmarks, we collect $5,000$ independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

TL;DR

Abstract

Questions involving commonsense reasoning about everyday situations often admit many

answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the

plausible answer choice. On

MCQ items sampled from two commonsense reasoning benchmarks, we collect

independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

TL;DR

Abstract

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)