Variability Need Not Imply Error: The Case of Adequate but Semantically Distinct Responses
Evgenia Ilia, Wilker Aziz
TL;DR
The paper critiques semantic-entropy-based uncertainty measures for open-ended prompts, where multiple adequate interpretations exist, and introduces ProbAR, the probability that sampled responses are adequate, as a general, instance-level confidence metric. ProbAR is computed by estimating adequacy with a binary judge across sampled outputs and aggregating via Monte Carlo sampling, enabling selective prediction that better aligns with downstream correctness across prompts of varying open-endedness. Extensive experiments on OPT models across Abg-COQA, AmbigQA, and Provo Corpus demonstrate that ProbAR outperforms semantic entropy and related baselines in AUROC and selective-precision, including both manual upper-bound analyses and automated evaluations. The work provides a practical uncertainty quantifier for open-ended NLP tasks and lays groundwork for confidence-aware decoding and broader application to tasks with high data uncertainty.
Abstract
With the broader use of language models (LMs) comes the need to estimate their ability to respond reliably to prompts (e.g., are generated responses likely to be correct?). Uncertainty quantification tools (notions of confidence and entropy, i.a.) can be used to that end (e.g., to reject a response when the model is `uncertain'). For example, Kuhn et al. (semantic entropy; 2022b) regard semantic variation amongst sampled responses as evidence that the model `struggles' with the prompt and that the LM is likely to err. We argue that semantic variability need not imply error--this being especially intuitive in open-ended settings, where prompts elicit multiple adequate but semantically distinct responses. Hence, we propose to annotate sampled responses for their adequacy to the prompt (e.g., using a classifier) and estimate the Probability the model assigns to Adequate Responses (PROBAR), which we then regard as an indicator of the model's reliability at the instance level. We evaluate PROBAR as a measure of confidence in selective prediction with OPT models (in two QA datasets and in next-word prediction, for English) and find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.
