Selectively Answering Ambiguous Questions
Jeremy R. Cole, Michael J. Q. Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, Jacob Eisenstein
TL;DR
This work tackles selective QA under denotational and epistemic uncertainty, focusing on ambiguous questions where meaning and facts may both be unclear. It proposes a disambiguate-then-answer framework and demonstrates that sampling-based confidence signals (Sampling Repetition, Sampling Diversity) yield better calibration than traditional Likelihood or Self-Verification approaches, especially for ambiguous queries. The authors provide extensive evaluation on unambiguous datasets (Natural Questions, TriviaQA) and ambiguous datasets (AmbigQA, SituatedQA), showing improved C@80 and overall calibration when using sampling-based scores, including with instruction-tuned models. The results support deploying sampling-based calibration to enable abstention and safer answer generation in real-world QA systems, while acknowledging trade-offs in compute and limitations of the single-model scope and closed-book setting.
Abstract
Trustworthy language models should abstain from answering questions when they do not know the answer. However, the answer to a question can be unknown for a variety of reasons. Prior research has focused on the case in which the question is clear and the answer is unambiguous but possibly unknown, but the answer to a question can also be unclear due to uncertainty of the questioner's intent or context. We investigate question answering from this perspective, focusing on answering a subset of questions with a high degree of accuracy, from a set of questions in which many are inherently ambiguous. In this setting, we find that the most reliable approach to decide when to abstain involves quantifying repetition within sampled model outputs, rather than the model's likelihood or self-verification as used in prior work. We find this to be the case across different types of uncertainty and model scales,and with or without instruction tuning. Our results suggest that sampling-based confidence scores help calibrate answers to relatively unambiguous questions, with more dramatic improvements on ambiguous questions.
