Table of Contents
Fetching ...

Selectively Answering Ambiguous Questions

Jeremy R. Cole, Michael J. Q. Zhang, Daniel Gillick, Julian Martin Eisenschlos, Bhuwan Dhingra, Jacob Eisenstein

TL;DR

This work tackles selective QA under denotational and epistemic uncertainty, focusing on ambiguous questions where meaning and facts may both be unclear. It proposes a disambiguate-then-answer framework and demonstrates that sampling-based confidence signals (Sampling Repetition, Sampling Diversity) yield better calibration than traditional Likelihood or Self-Verification approaches, especially for ambiguous queries. The authors provide extensive evaluation on unambiguous datasets (Natural Questions, TriviaQA) and ambiguous datasets (AmbigQA, SituatedQA), showing improved C@80 and overall calibration when using sampling-based scores, including with instruction-tuned models. The results support deploying sampling-based calibration to enable abstention and safer answer generation in real-world QA systems, while acknowledging trade-offs in compute and limitations of the single-model scope and closed-book setting.

Abstract

Trustworthy language models should abstain from answering questions when they do not know the answer. However, the answer to a question can be unknown for a variety of reasons. Prior research has focused on the case in which the question is clear and the answer is unambiguous but possibly unknown, but the answer to a question can also be unclear due to uncertainty of the questioner's intent or context. We investigate question answering from this perspective, focusing on answering a subset of questions with a high degree of accuracy, from a set of questions in which many are inherently ambiguous. In this setting, we find that the most reliable approach to decide when to abstain involves quantifying repetition within sampled model outputs, rather than the model's likelihood or self-verification as used in prior work. We find this to be the case across different types of uncertainty and model scales,and with or without instruction tuning. Our results suggest that sampling-based confidence scores help calibrate answers to relatively unambiguous questions, with more dramatic improvements on ambiguous questions.

Selectively Answering Ambiguous Questions

TL;DR

This work tackles selective QA under denotational and epistemic uncertainty, focusing on ambiguous questions where meaning and facts may both be unclear. It proposes a disambiguate-then-answer framework and demonstrates that sampling-based confidence signals (Sampling Repetition, Sampling Diversity) yield better calibration than traditional Likelihood or Self-Verification approaches, especially for ambiguous queries. The authors provide extensive evaluation on unambiguous datasets (Natural Questions, TriviaQA) and ambiguous datasets (AmbigQA, SituatedQA), showing improved C@80 and overall calibration when using sampling-based scores, including with instruction-tuned models. The results support deploying sampling-based calibration to enable abstention and safer answer generation in real-world QA systems, while acknowledging trade-offs in compute and limitations of the single-model scope and closed-book setting.

Abstract

Trustworthy language models should abstain from answering questions when they do not know the answer. However, the answer to a question can be unknown for a variety of reasons. Prior research has focused on the case in which the question is clear and the answer is unambiguous but possibly unknown, but the answer to a question can also be unclear due to uncertainty of the questioner's intent or context. We investigate question answering from this perspective, focusing on answering a subset of questions with a high degree of accuracy, from a set of questions in which many are inherently ambiguous. In this setting, we find that the most reliable approach to decide when to abstain involves quantifying repetition within sampled model outputs, rather than the model's likelihood or self-verification as used in prior work. We find this to be the case across different types of uncertainty and model scales,and with or without instruction tuning. Our results suggest that sampling-based confidence scores help calibrate answers to relatively unambiguous questions, with more dramatic improvements on ambiguous questions.
Paper Structure (40 sections, 1 equation, 5 figures, 8 tables)

This paper contains 40 sections, 1 equation, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Uncertainty in Question Answering systems may arise in various ways. We propose a scheme called disambiguate then answer where the model first attempts to pose an unambiguous interpretation of the user question (yellow), then selectively produces an answer to this question, alternatively abstaining ("Unknown") (green). The log likelihood of the model is shown above each generation. In addition, we find that sampling multiple times from the model generally allows for more robust confidence estimates.
  • Figure 2: Methods for estimating the confidence of answers from an LM. Sampling repetition counts the number of times the greedy answer appears among samples from the LM. Sampling diversity counts the number of unique answers among samples from the LM. Self-verification, proposed by kadavath2022language, prompts the LM again with one of the sampled answers to measure the token-level probability of True. The prompts used for unambiguous and ambiguous questions are shown on the left---for the latter, we additionally prompt the model for disambiguations (omitted in the outputs shown on the right for brevity).
  • Figure 3: Plot of calibration error by comparing bucketed accuracy to bucketed confidence scores across methods. Plotted on the unambiguous portion of Natural Questions.
  • Figure 4: Plot of calibration error by comparing bucketed accuracy to bucketed confidence scores across methods. Plotted on the combined version of AmbigQA, containing ambiguous and unambiguous questions.
  • Figure 5: Precision vs recall for ambiguity prediction. For the sampling-based methods, each point corresponds to a classification threshold corresponding to counts over the ten sampled outputs. The greedy predictions are plotted as single points. None of these systems improve precision over the baseline rate of 53%.