Table of Contents
Fetching ...

Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores

Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt

TL;DR

PlausibleQA introduces a large-scale QA dataset that explicitly annotates candidate answers with plausibility scores and justifications, addressing the gap where plausible but incorrect answers are undervalued in QA evaluation. The dataset comprises 10,000 questions, 100,000 candidate answers, and up to 1,000,000 total justifications, including 900,000 pairwise comparison justifications, generated through a three-module pipeline (Question Sampling, Candidate Answer Generation, Preparation) and processed with both listwise and pairwise scoring frameworks. Through extensive experiments in MCQA distractor generation and QA robustness assessment (QARA), the authors demonstrate that plausibility-aware evaluation yields more informative distractors and reveals robustness gaps in current LLMs, particularly on harder questions. PlausibleQA thus provides a practical resource to improve QA evaluation, drive targeted model improvements, and enable adaptive, plausibility-informed testing across QA-related tasks.

Abstract

Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.

Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores

TL;DR

PlausibleQA introduces a large-scale QA dataset that explicitly annotates candidate answers with plausibility scores and justifications, addressing the gap where plausible but incorrect answers are undervalued in QA evaluation. The dataset comprises 10,000 questions, 100,000 candidate answers, and up to 1,000,000 total justifications, including 900,000 pairwise comparison justifications, generated through a three-module pipeline (Question Sampling, Candidate Answer Generation, Preparation) and processed with both listwise and pairwise scoring frameworks. Through extensive experiments in MCQA distractor generation and QA robustness assessment (QARA), the authors demonstrate that plausibility-aware evaluation yields more informative distractors and reveals robustness gaps in current LLMs, particularly on harder questions. PlausibleQA thus provides a practical resource to improve QA evaluation, drive targeted model improvements, and enable adaptive, plausibility-informed testing across QA-related tasks.

Abstract

Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.

Paper Structure

This paper contains 25 sections, 3 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Example from the PlausibleQA dataset: The Question and Answer sections present the question along with its gold answer. The Plausibility Score section ranks candidate answers in descending order based on their plausibility scores. Justification Boxes provide the justifications for the selection of each candidate answer.
  • Figure 2: The pipeline of PlausibleQA generation: ① Datasets are passed to the Question Sampling. ② The type of each question is detected. ③ Questions are filtered accordingly. ④ 10,000 questions are sampled from the filtered questions. ⑤ Using LLMs, 10 candidate answers are generated for each question. ⑥ If the output does not pass the filtering stage, it is retried until ⑦ 10 unique candidate answers are generated. ⑧ The candidate answers are then passed to the Preparation stage. ⑨ Before this, the candidate answers undergo Pairs Creation. ①0 The candidate answers are converted into pairwise items ①1 and passed to an LLM for pairwise comparison . ①2 The results are passed to the Preparation stage, where they are converted from pairwise order to listwise order. ①3 The question difficulty and answer difficulty are then evaluated. Finally, ①4 all generated attributes are stored in the PlausibleQA dataset. The numbers in the arrows indicate the number of output items. Additionally, Qs represents Questions, and Cs represents Candidates.
  • Figure 3: Prompt for Candidate Answer Generation. <question> represents the given question, while <ground_truth> denotes its correct answer. Each candidate answer is represented by <candidate_answer>, with an initial plausibility score (<plausibility_score>) and a justification (<justification>) justifying both the answer choice and the assigned plausibility score.
  • Figure 4: Prompt for Pairwise Candidate Answer Comparison. <question> represents the given question, while <ground_truth> denotes its correct answer. The candidate answers $x_1$ and $x_2$ are represented by <ca_1> and <ca_2>, respectively.
  • Figure 5: The distribution of the PlausibleQA dataset across the TriviaQA, NQ, and WebQ datasets.
  • ...and 7 more figures