Towards Reliable and Factual Response Generation: Detecting Unanswerable Questions in Information-Seeking Conversations
Weronika Łajewska, Krisztian Balog
TL;DR
This paper tackles unanswerable questions in information-seeking conversations by introducing an answerability detector that operates after passage retrieval to determine whether the question can be answered from the corpus, signaling unanswerability before generation and aiming to curb hallucinations. It proposes a two-step CIS pipeline with a sentence-level BERT classifier whose predictions are aggregated across top passages and rankings (top-$n$ with $n=3$) to produce a final answerability estimate, and it introduces the CAsT-answerability dataset with sentence-, passage-, and ranking-level labels. The authors show that a simple aggregation-based baseline, especially with max at the passage level and mean at the ranking level, can outperform a state-of-the-art LLM on answerability tasks, and that SQuAD 2.0 augmentation helps at lower levels but not at ranking level. Overall, the work enhances transparency and reliability in retrieval-grounded dialogue systems and makes the dataset and code publicly available for further research, with implications for reducing hallucinations in information-seeking AI systems.
Abstract
Generative AI models face the challenge of hallucinations that can undermine users' trust in such systems. We approach the problem of conversational information seeking as a two-step process, where relevant passages in a corpus are identified first and then summarized into a final system response. This way we can automatically assess if the answer to the user's question is present in the corpus. Specifically, our proposed method employs a sentence-level classifier to detect if the answer is present, then aggregates these predictions on the passage level, and eventually across the top-ranked passages to arrive at a final answerability estimate. For training and evaluation, we develop a dataset based on the TREC CAsT benchmark that includes answerability labels on the sentence, passage, and ranking levels. We demonstrate that our proposed method represents a strong baseline and outperforms a state-of-the-art LLM on the answerability prediction task.
