Table of Contents
Fetching ...

Answerability in Retrieval-Augmented Open-Domain Question Answering

Rustam Abdumalikov, Pasquale Minervini, Yova Kementchedjhieva

TL;DR

This paper addresses the problem of answerability in open-domain QA by showing that random negative sampling for training often yields poor generalization to semantically related but irrelevant excerpts. It proposes augmenting training with unanswerable examples from SQuAD 2.0 and leveraging ChatGPT-generated excerpts to simulate realistic and challenging contexts. Using T5-Large and T5-XL, the study demonstrates that randomization leads to high abstention on easy cases but fails on overlapping content, while SQuAD 2.0-based training yields near-perfect abstention even for difficult excerpts. The findings highlight the importance of teaching ODQA systems to abstain when evidence is insufficient, contributing to more trustworthy and robust QA systems in real-world retrieval settings.

Abstract

The performance of Open-Domain Question Answering (ODQA) retrieval systems can exhibit sub-optimal behavior, providing text excerpts with varying degrees of irrelevance. Unfortunately, many existing ODQA datasets lack examples specifically targeting the identification of irrelevant text excerpts. Previous attempts to address this gap have relied on a simplistic approach of pairing questions with random text excerpts. This paper aims to investigate the effectiveness of models trained using this randomized strategy, uncovering an important limitation in their ability to generalize to irrelevant text excerpts with high semantic overlap. As a result, we observed a substantial decrease in predictive accuracy, from 98% to 1%. To address this limitation, we discovered an efficient approach for training models to recognize such excerpts. By leveraging unanswerable pairs from the SQuAD 2.0 dataset, our models achieve a nearly perfect (~100%) accuracy when confronted with these challenging text excerpts.

Answerability in Retrieval-Augmented Open-Domain Question Answering

TL;DR

This paper addresses the problem of answerability in open-domain QA by showing that random negative sampling for training often yields poor generalization to semantically related but irrelevant excerpts. It proposes augmenting training with unanswerable examples from SQuAD 2.0 and leveraging ChatGPT-generated excerpts to simulate realistic and challenging contexts. Using T5-Large and T5-XL, the study demonstrates that randomization leads to high abstention on easy cases but fails on overlapping content, while SQuAD 2.0-based training yields near-perfect abstention even for difficult excerpts. The findings highlight the importance of teaching ODQA systems to abstain when evidence is insufficient, contributing to more trustworthy and robust QA systems in real-world retrieval settings.

Abstract

The performance of Open-Domain Question Answering (ODQA) retrieval systems can exhibit sub-optimal behavior, providing text excerpts with varying degrees of irrelevance. Unfortunately, many existing ODQA datasets lack examples specifically targeting the identification of irrelevant text excerpts. Previous attempts to address this gap have relied on a simplistic approach of pairing questions with random text excerpts. This paper aims to investigate the effectiveness of models trained using this randomized strategy, uncovering an important limitation in their ability to generalize to irrelevant text excerpts with high semantic overlap. As a result, we observed a substantial decrease in predictive accuracy, from 98% to 1%. To address this limitation, we discovered an efficient approach for training models to recognize such excerpts. By leveraging unanswerable pairs from the SQuAD 2.0 dataset, our models achieve a nearly perfect (~100%) accuracy when confronted with these challenging text excerpts.
Paper Structure (12 sections, 5 tables)