Table of Contents
Fetching ...

On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

Ziyu Wang, Chris Holmes

TL;DR

This work derives a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk, and demonstrates that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.

Abstract

Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed methods can be applied to black-box language models. We illustrate the methods on question answering and machine translation tasks. Our experiments provide a principled evaluation of task-specific calibration, and demonstrate that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.

On Subjective Uncertainty Quantification and Calibration in Natural Language Generation

TL;DR

This work derives a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk, and demonstrates that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.

Abstract

Applications of large language models often involve the generation of free-form responses, in which case uncertainty quantification becomes challenging. This is due to the need to identify task-specific uncertainties (e.g., about the semantics) which appears difficult to define in general cases. This work addresses these challenges from a perspective of Bayesian decision theory, starting from the assumption that our utility is characterized by a similarity measure that compares a generated response with a hypothetical true response. We discuss how this assumption enables principled quantification of the model's subjective uncertainty and its calibration. We further derive a measure for epistemic uncertainty, based on a missing data perspective and its characterization as an excess risk. The proposed methods can be applied to black-box language models. We illustrate the methods on question answering and machine translation tasks. Our experiments provide a principled evaluation of task-specific calibration, and demonstrate that epistemic uncertainty offers a promising deferral strategy for efficient data acquisition in in-context learning.
Paper Structure (30 sections, 16 equations, 12 figures, 2 tables)

This paper contains 30 sections, 16 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Question answering: ECE and utility for the llama-3.1-70b model before and after instruction tuning. Error bar denotes 95% bootstrap CI.
  • Figure 2: Question answering: test utility vs ECE for isntruction-tuned LMs, averaged over the datasets in lin2024generating. App. \ref{['app:exp-qa-results']} presents results for individual datasets.
  • Figure 3: Question answering: calibration error from different methods. Error bar denotes 95% bootstrap CI.
  • Figure 4: Machine translation: average utility vs the number of deferrals to the many-shot predictor. We also report the p-value of a permutation test that compares the AUC-DF from the EU method to random.
  • Figure 5: Question answering: prompt template used to filter trivia datasets.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Remark 2.1
  • Remark 2.2: generalizations
  • Claim A.1
  • proof