Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets

Ashwin Daswani; Rohan Sawant; Najoung Kim

Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets

Ashwin Daswani, Rohan Sawant, Najoung Kim

TL;DR

Syn-(QA)^2 tackles the problem that QA systems struggle with false assumptions in long-tail information-seeking questions by providing two synthetic datasets generated via entity perturbation on Wikidata and HotpotQA. The authors systematically compare multiple large language models on a binary false-assumption detection task and on generative QA, revealing persistent challenges, model biases, and the benefits of search-engine augmentation. The datasets enable controlled minimal-pair comparisons that isolate the impact of false premises in both single-hop and multi-hop contexts. The work highlights the need for robust evaluation protocols for false premises in open-domain QA and offers a practical resource for diagnosing and improving future systems.

Abstract

Sensitivity to false assumptions (or false premises) in information-seeking questions is critical for robust question-answering (QA) systems. Recent work has shown that false assumptions in naturally occurring questions pose challenges to current models, with low performance on both generative QA and simple detection tasks (Kim et al. 2023). However, the focus of existing work on naturally occurring questions leads to a gap in the analysis of model behavior on the long tail of the distribution of possible questions. To this end, we introduce Syn-(QA)$^2$, a set of two synthetically generated QA datasets: one generated using perturbed relations from Wikidata, and the other by perturbing HotpotQA (Yang et al. 2018). Our findings from evaluating a range of large language models are threefold: (1) false assumptions in QA are challenging, echoing the findings of prior work, (2) the binary detection task is challenging even compared to the difficulty of generative QA itself, possibly due to the linguistic structure of the problem, and (3) the detection task is more challenging with long-tail questions compared to naturally occurring questions, highlighting the utility of our synthetic datasets and generation method.

Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets

TL;DR

Abstract

, a set of two synthetically generated QA datasets: one generated using perturbed relations from Wikidata, and the other by perturbing HotpotQA (Yang et al. 2018). Our findings from evaluating a range of large language models are threefold: (1) false assumptions in QA are challenging, echoing the findings of prior work, (2) the binary detection task is challenging even compared to the difficulty of generative QA itself, possibly due to the linguistic structure of the problem, and (3) the detection task is more challenging with long-tail questions compared to naturally occurring questions, highlighting the utility of our synthetic datasets and generation method.

Paper Structure (19 sections, 4 figures, 4 tables)

This paper contains 19 sections, 4 figures, 4 tables.

Motivation
Dataset
Generating single-hop questions with false assumptions from Wikidata relations
Generating multi-hop questions with false assumptions from HotpotQA
Experiments
Evaluation metrics
Models and Prompting
Results
Patterns of response bias
Effect of search-engine augmentation
Manual generative QA evaluation
Discussion
Are synthetic, long-tail false assumptions more difficult to detect than naturally occurring ones?
Difficulty of generative QA vs. False assumption detection
Conclusion
...and 4 more sections

Figures (4)

Figure 1: Example of a minimal pair of questions without and with false assumptions in our single-hop and multi-hop datasets. The perturbed entity that is responsible for the false assumption is colored red and the pre-perturbation entity is colored teal.
Figure 2: Visualization of the single-hop question generation process.
Figure 3: Accuracy for the false assumption detection task on our datasets. Since few-shot and few-shot CoT did not show substantially different trends, we only show the few-shot results here (see Appendix \ref{['app:full-results']} for full results). Zero-shot Flan-T5-XXL did not yield correct class labels in most cases.
Figure 4: Results of the generative QA evaluation (left) and results comparing the difficulty of false assumption detection for synthetic and naturally occurring questions (right).

Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets

TL;DR

Abstract

Syn-QA2: Evaluating False Assumptions in Long-tail Questions with Synthetic QA Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (4)