Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang; Gianni Barlacchi; Sandro Pezzelle

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle

TL;DR

This work shows that a substantial fraction of questions in popular QA benchmarks are underspecified, meaning their intended meaning cannot be determined without extra context. It introduces an LLM-based UND classifier to detect underspecification and an LLM-based rewriter to convert UND questions into fully specified variants while preserving the true answer. Across multiple datasets and two QA models, performance on underspecified questions is consistently worse, but rewriting UND questions to FS substantially closes the gap and, in many cases, yields near-parity with FS questions, indicating the bottleneck lies in question formulation rather than model capability. The study highlights underspecification as a critical factor in QA evaluation, provides reproducible tools for detection and rewriting, and argues for benchmark designs that emphasize question clarity to reliably assess model performance and progress.

Abstract

Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

TL;DR

Abstract

Paper Structure (37 sections, 10 figures, 5 tables)

This paper contains 37 sections, 10 figures, 5 tables.

Introduction
Approach
Step 1: Detecting underspecified questions
Step 2: Assessing LLM performance on QA
Step 3: Rewriting UND questions
Step 4: Reassessing LLM's QA performance
Experiments
Step 1: Detecting underspecified questions
Data
Models
Experimental setup
Results
Step 2: Assessing LLM QA performance
Data
Models
...and 22 more sections

Figures (10)

Figure 1: Top: One real question from Natural Questions (NQ) dataset Kwiatkowski2019 and corresponding wrong answer by a QA model, i.e., GPT-4o openai2024gpt4oopenai2024gpt4ocard. Middle: We build an LLM-based classifier to detect underspecified (UND) questions in QA benchmarks. Bottom: We turn UND questions into fully specified ones using an LLM-based rewriter and verify that, by removing underspecification, performance on the QA task significantly improves.
Figure 2: Proportion of FS/UND questions in each of the five source datasets included in QA-ensemble.
Figure 3: (Step 2) The FS/UND classification results of the NQ sample in QA-ensemble.
Figure 4: (Step 2) The FS/UND classification results of the HotpotQA sample in QA-ensemble.
Figure 5: (Step 2) The FS/UND classification results of the TriviaQA sample in QA-ensemble.
...and 5 more figures

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

TL;DR

Abstract

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Authors

TL;DR

Abstract

Table of Contents

Figures (10)