Table of Contents
Fetching ...

Variability Need Not Imply Error: The Case of Adequate but Semantically Distinct Responses

Evgenia Ilia, Wilker Aziz

TL;DR

The paper critiques semantic-entropy-based uncertainty measures for open-ended prompts, where multiple adequate interpretations exist, and introduces ProbAR, the probability that sampled responses are adequate, as a general, instance-level confidence metric. ProbAR is computed by estimating adequacy with a binary judge across sampled outputs and aggregating via Monte Carlo sampling, enabling selective prediction that better aligns with downstream correctness across prompts of varying open-endedness. Extensive experiments on OPT models across Abg-COQA, AmbigQA, and Provo Corpus demonstrate that ProbAR outperforms semantic entropy and related baselines in AUROC and selective-precision, including both manual upper-bound analyses and automated evaluations. The work provides a practical uncertainty quantifier for open-ended NLP tasks and lays groundwork for confidence-aware decoding and broader application to tasks with high data uncertainty.

Abstract

With the broader use of language models (LMs) comes the need to estimate their ability to respond reliably to prompts (e.g., are generated responses likely to be correct?). Uncertainty quantification tools (notions of confidence and entropy, i.a.) can be used to that end (e.g., to reject a response when the model is `uncertain'). For example, Kuhn et al. (semantic entropy; 2022b) regard semantic variation amongst sampled responses as evidence that the model `struggles' with the prompt and that the LM is likely to err. We argue that semantic variability need not imply error--this being especially intuitive in open-ended settings, where prompts elicit multiple adequate but semantically distinct responses. Hence, we propose to annotate sampled responses for their adequacy to the prompt (e.g., using a classifier) and estimate the Probability the model assigns to Adequate Responses (PROBAR), which we then regard as an indicator of the model's reliability at the instance level. We evaluate PROBAR as a measure of confidence in selective prediction with OPT models (in two QA datasets and in next-word prediction, for English) and find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.

Variability Need Not Imply Error: The Case of Adequate but Semantically Distinct Responses

TL;DR

The paper critiques semantic-entropy-based uncertainty measures for open-ended prompts, where multiple adequate interpretations exist, and introduces ProbAR, the probability that sampled responses are adequate, as a general, instance-level confidence metric. ProbAR is computed by estimating adequacy with a binary judge across sampled outputs and aggregating via Monte Carlo sampling, enabling selective prediction that better aligns with downstream correctness across prompts of varying open-endedness. Extensive experiments on OPT models across Abg-COQA, AmbigQA, and Provo Corpus demonstrate that ProbAR outperforms semantic entropy and related baselines in AUROC and selective-precision, including both manual upper-bound analyses and automated evaluations. The work provides a practical uncertainty quantifier for open-ended NLP tasks and lays groundwork for confidence-aware decoding and broader application to tasks with high data uncertainty.

Abstract

With the broader use of language models (LMs) comes the need to estimate their ability to respond reliably to prompts (e.g., are generated responses likely to be correct?). Uncertainty quantification tools (notions of confidence and entropy, i.a.) can be used to that end (e.g., to reject a response when the model is `uncertain'). For example, Kuhn et al. (semantic entropy; 2022b) regard semantic variation amongst sampled responses as evidence that the model `struggles' with the prompt and that the LM is likely to err. We argue that semantic variability need not imply error--this being especially intuitive in open-ended settings, where prompts elicit multiple adequate but semantically distinct responses. Hence, we propose to annotate sampled responses for their adequacy to the prompt (e.g., using a classifier) and estimate the Probability the model assigns to Adequate Responses (PROBAR), which we then regard as an indicator of the model's reliability at the instance level. We evaluate PROBAR as a measure of confidence in selective prediction with OPT models (in two QA datasets and in next-word prediction, for English) and find PROBAR to outperform semantic entropy across prompts with varying degrees of ambiguity/open-endedness.

Paper Structure

This paper contains 41 sections, 3 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Bottom: a sampled-based approximation of an LM's distribution over responses given the question 'What is a date?'; while the distribution exhibits high entropy, some answers are semantically equivalent. Middle: responses are clustered by meaning; while this representation still exhibits high 'semantic entropy' kuhn2022semantic, probability concentrates on answers to different but plausible interpretations of the question. Top: responses are grouped as a function of their adequacy to the prompt (i.e., wrt any of its plausible interpretations); we regard the probability accumulated by adequate responses (ProbAR) as an expression of confidence and expect it to predict a model's instance-level performance on both more and less ambiguous/open-ended prompts.
  • Figure 2: For each question, we show adequate responses (wrt a plausible interpretation of the prompt) in green and inadequate in red (also italicised), bounding boxes highlight semantic clusters (relevant for SE) and adequacy/inadequacy (relevant for ProbAR). For the first prompt, SE makes reasonable predictions (high/low uncertainty for model A/B, resp.) because the question strongly constrains plausible answers for their semantic content. In the second prompt, the ambiguity inherent to 'date' allows for plausible answers that convey different meanings. SE makes some poor predictions (e.g., that model C and E are maximally uncertain, obscuring the fact that all answers from C are adequate; that model F is fairly certain, obscuring the fact that the dominant cluster is made of nonsensical responses), while ProbAR makes a reasonable prediction in each case.
  • Figure 3: AUROC values for various quantifiers for Abg-COQA.
  • Figure 4: AUROCs for Provo Corpus (left) and AmbigQA (right) tasks. For Provo Corpus (200 prompts), correctness was assessed via exact matching of a sampled response to human references (AUROC is as an average over 5 runs). For AmbigQA (1070 prompts), correctness of the greedy was assessed by gpt3.5-turbo.
  • Figure 5: Example of an ambiguous instance from Abg-COQA.
  • ...and 16 more figures