Evidentiality-aware Retrieval for Overcoming Abstractiveness in Open-Domain Question Answering
Yongho Song, Dahyun Lee, Myungha Jang, Seung-won Hwang, Kyungjae Lee, Dongha Lee, Jinyeong Yeo
TL;DR
The paper addresses misalignment between relevance signals and answerability in abstractive open-domain QA by proposing Evidentiality-Aware Dense Passage Retrieval (EADPR). EADPR augments training data with synthetic distractors that remove evidence spans and uses pseudo-evidence to create hard negatives and pseudo-positives, enabling the retriever to learn evidentiality-aware representations. The method extends the DPR objective with $\mathcal{L}_{dpr}$, $\mathcal{L}_{HN}$, and $\mathcal{L}_{PP}$, forming $\mathcal{L}_{eadpr} = \mathcal{L}_{dpr} + \tau_1 \mathcal{L}_{HN} + \tau_2 \mathcal{L}_{PP}$. Experiments across Natural Questions, TriviaQA, TREC, and HotpotQA demonstrate consistent improvements in retrieval metrics and end-to-end QA, with enhanced robustness to distractors and greater label efficiency relative to standard dense retrievers.
Abstract
The long-standing goal of dense retrievers in abtractive open-domain question answering (ODQA) tasks is to learn to capture evidence passages among relevant passages for any given query, such that the reader produce factually correct outputs from evidence passages. One of the key challenge is the insufficient amount of training data with the supervision of the answerability of the passages. Recent studies rely on iterative pipelines to annotate answerability using signals from the reader, but their high computational costs hamper practical applications. In this paper, we instead focus on a data-centric approach and propose Evidentiality-Aware Dense Passage Retrieval (EADPR), which leverages synthetic distractor samples to learn to discriminate evidence passages from distractors. We conduct extensive experiments to validate the effectiveness of our proposed method on multiple abstractive ODQA tasks.
