Wrong Answers Can Also Be Useful: PlausibleQA -- A Large-Scale QA Dataset with Answer Plausibility Scores
Jamshid Mozafari, Abdelrahman Abdallah, Bhawna Piryani, Adam Jatowt
TL;DR
PlausibleQA introduces a large-scale QA dataset that explicitly annotates candidate answers with plausibility scores and justifications, addressing the gap where plausible but incorrect answers are undervalued in QA evaluation. The dataset comprises 10,000 questions, 100,000 candidate answers, and up to 1,000,000 total justifications, including 900,000 pairwise comparison justifications, generated through a three-module pipeline (Question Sampling, Candidate Answer Generation, Preparation) and processed with both listwise and pairwise scoring frameworks. Through extensive experiments in MCQA distractor generation and QA robustness assessment (QARA), the authors demonstrate that plausibility-aware evaluation yields more informative distractors and reveals robustness gaps in current LLMs, particularly on harder questions. PlausibleQA thus provides a practical resource to improve QA evaluation, drive targeted model improvements, and enable adaptive, plausibility-informed testing across QA-related tasks.
Abstract
Large Language Models (LLMs) are revolutionizing information retrieval, with chatbots becoming an important source for answering user queries. As by their design, LLMs prioritize generating correct answers, the value of highly plausible yet incorrect answers (candidate answers) tends to be overlooked. However, such answers can still prove useful, for example, they can play a crucial role in tasks like Multiple-Choice Question Answering (MCQA) and QA Robustness Assessment (QARA). Existing QA datasets primarily focus on correct answers without explicit consideration of the plausibility of other candidate answers, limiting opportunity for more nuanced evaluations of models. To address this gap, we introduce PlausibleQA, a large-scale dataset comprising 10,000 questions and 100,000 candidate answers, each annotated with plausibility scores and justifications for their selection. Additionally, the dataset includes 900,000 justifications for pairwise comparisons between candidate answers, further refining plausibility assessments. We evaluate PlausibleQA through human assessments and empirical experiments, demonstrating its utility in MCQA and QARA analysis. Our findings show that plausibility-aware approaches are effective for MCQA distractor generation and QARA. We release PlausibleQA as a resource for advancing QA research and enhancing LLM performance in distinguishing plausible distractors from correct answers.
