On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

Ziyu Wang; Chris Holmes

On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

Ziyu Wang, Chris Holmes

TL;DR

This work derives a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk, and demonstrates that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.

Abstract

Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed methods can be applied to black-box language models. We illustrate the methods on question answering and machine translation tasks. Our experiments provide a principled evaluation of task-specific calibration, and demonstrate that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.

On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

TL;DR

Abstract

Paper Structure (30 sections, 16 equations, 12 figures, 2 tables)

This paper contains 30 sections, 16 equations, 12 figures, 2 tables.

Introduction
Quantifying Uncertainty and Calibration
A Utilitarian Setup
Task-Specific Measures of Subjective Uncertainty
Evaluation of Task-Specific Calibration
Representing and Eliciting Epistemic Uncertainty
Epistemic uncertainty in ICL.
Bayesian justifications.
Connection to non-Bayesian methods.
Related Work
Probabilistic UQ for LMs.
Calibration in free-form generation.
Prompt-based methods, instruction-tuned LMs.
Epistemic uncertainty.
Experiments
...and 15 more sections

Figures (12)

Figure 1: Question answering: ECE and utility for the llama-3.1-70b model before and after instruction tuning. Error bar denotes 95% bootstrap CI.
Figure 2: Question answering: test utility vs ECE for isntruction-tuned LMs, averaged over the datasets in lin2024generating. App. \ref{['app:exp-qa-results']} presents results for individual datasets.
Figure 3: Question answering: calibration error from different methods. Error bar denotes 95% bootstrap CI.
Figure 4: Machine translation: average utility vs the number of deferrals to the many-shot predictor. We also report the p-value of a permutation test that compares the AUC-DF from the EU method to random.
Figure 5: Question answering: prompt template used to filter trivia datasets.
...and 7 more figures

Theorems & Definitions (4)

Remark 2.1
Remark 2.2: generalizations
Claim A.1
proof

On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

TL;DR

Abstract

On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)

Theorems & Definitions (4)