Table of Contents
Fetching ...

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Shramay Palta, Nishant Balepur, Peter Rankel, Sarah Wiegreffe, Marine Carpuat, Rachel Rudinger

TL;DR

It is found that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, it is confirmed that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices.

Abstract

Questions involving commonsense reasoning about everyday situations often admit many $\textit{possible}$ or $\textit{plausible}$ answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the $\textit{most}$ plausible answer choice. On $250$ MCQ items sampled from two commonsense reasoning benchmarks, we collect $5,000$ independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

TL;DR

It is found that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, it is confirmed that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices.

Abstract

Questions involving commonsense reasoning about everyday situations often admit many or answers. In contrast, multiple-choice question (MCQ) benchmarks for commonsense reasoning require a hard selection of a single correct answer, which, in principle, should represent the plausible answer choice. On MCQ items sampled from two commonsense reasoning benchmarks, we collect independent plausibility judgments on answer choices. We find that for over 20% of the sampled MCQs, the answer choice rated most plausible does not match the benchmark gold answers; upon manual inspection, we confirm that this subset exhibits higher rates of problems like ambiguity or semantic mismatch between question and answer choices. Experiments with LLMs reveal low accuracy and high variation in performance on the subset, suggesting our plausibility criterion may be helpful in identifying more reliable benchmark items for commonsense evaluation.

Paper Structure

This paper contains 20 sections, 17 figures, 9 tables.

Figures (17)

  • Figure 1: An example question from Social IQa where the highest plausibility answer choice is not the gold label. The numbers indicate the plausibility ratings given by $5$ human annotators to each option on a 1-5 scale and the gold label is highlighted in bold. Numbers in parentheses represent the mean plausibility rating for that answer choice. The answer choice with the highest plausibility rating is underlined.
  • Figure 2: Difference in the plausibility scores between the top 2 most plausible choices (\ref{['plausibility_ratings']}) vs. percentage of votes (\ref{['full_question_annotations']}) received by the top choice (on SIQA (left) and CSQA (right)). The size of the point represents the number of data points at an instance.
  • Figure 3: Frequency of issues types on the "plausibly problematic" (solid) and non-problematic (hatched) questions from SIQA (left) and CSQA (right) ($28$MCQs each). It is important to note that these labels are not mutually exclusive and a question can be "plausibly problematic" due to multiple reasons and hence tagged with more than one label.
  • Figure 4: Histograms showing the difference between the mean gold label rating and best non-gold label rating. Portions of the graph in red with texture show cases where the best non-gold option had a higher mean plausibility rating than the mean gold label rating.
  • Figure 5: An example of the interface that annotators used while giving plausibility ratings to answer choices as described in \ref{['plausibility_ratings']}.
  • ...and 12 more figures