Uncertainty in Language Models: Assessment through Rank-Calibration

Xinmeng Huang; Shuo Li; Mengxin Yu; Matteo Sesia; Hamed Hassani; Insup Lee; Osbert Bastani; Edgar Dobriban

Uncertainty in Language Models: Assessment through Rank-Calibration

Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Hamed Hassani, Insup Lee, Osbert Bastani, Edgar Dobriban

TL;DR

The paper tackles the challenge of comparing uncertainty measures for language models, which vary in scale and can be entangled with model quality. It introduces Rank-Calibration and the Rank-Calibration Error (RCE) as a threshold-free, range-invariant framework that ties lower uncertainty to higher expected correctness via a monotone regression function. Empirical methods, including Empirical RCE and indication diagrams, enable practical evaluation and interpretation across diverse measures (NLL, semantic entropy, affinity-graph metrics, etc.) and datasets. The work demonstrates broad applicability, provides qualitative and quantitative insights, and suggests post-hoc recalibration as a practical enhancement, highlighting a path toward more reliable uncertainty assessment in NLG systems.

Abstract

Language Models (LMs) have shown promising performance in natural language generation. However, as LMs often generate incorrect or hallucinated responses, it is crucial to correctly quantify their uncertainty in responding to given inputs. In addition to verbalized confidence elicited via prompting, many uncertainty measures ($e.g.$, semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges ($e.g.$, $[0,\infty)$ or $[0,1]$). In this work, we address this issue by developing a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs. Our key tenet is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score ($e.g.$, ROUGE or METEOR). The broad applicability and the granular interpretability of our methods are demonstrated empirically.

Uncertainty in Language Models: Assessment through Rank-Calibration

TL;DR

Abstract

, semantic entropy and affinity-graph-based measures) have been proposed. However, these measures can differ greatly, and it is unclear how to compare them, partly because they take values over different ranges (

). In this work, we address this issue by developing a novel and practical framework, termed

, to assess uncertainty and confidence measures for LMs. Our key tenet is that higher uncertainty (or lower confidence) should imply lower generation quality, on average. Rank-calibration quantifies deviations from this ideal relationship in a principled manner, without requiring ad hoc binary thresholding of the correctness score (

, ROUGE or METEOR). The broad applicability and the granular interpretability of our methods are demonstrated empirically.

Paper Structure (52 sections, 2 theorems, 15 equations, 16 figures, 7 tables)

This paper contains 52 sections, 2 theorems, 15 equations, 16 figures, 7 tables.

Introduction
Correctness and Uncertainty for LMs
Limitations of Existing Assessments
Ad hoc correctness thresholding.
Diverse output ranges.
Strong dependence on LM performance.
Desiderata of evaluation.
Rank-Calibration
Rank-Calibration & RCE
Extension to confidence measures.
Comparison with Classical Calibration
Empirical RCE & Indication Diagram
Empirical RCE.
Indication diagram.
Advantages of rank-calibration.
...and 37 more sections

Key Result

Theorem 1

Suppose the correctness function $A$ takes values in $\{0,1\}$. If an uncertainty measure $U$ is rank-calibrated, i.e., its eqn:rank-ece is zero, then there exists a unique strictly decreasing transformation $g^\star\space:\mathbb{R}\space\to\space[0,1]$ such that $C_{g^\star}:=g^\star(U)$ is calibr

Figures (16)

Figure 1: Indication diagrams comparing two uncertainty measures, $U_{\rm NLL}$ (negative log-likelihood) and $U_{\rm Ecc}$ (eccentricity), for the GPT-3.5-turbo model on the TriviaQA benchmark. The red bars indicate the average correctness of different outputs, as a function of the corresponding relative uncertainty levels. The blue and shallow red areas---deviating from the anti-diagonal line---indicate where the uncertainty measures are over-optimistic and pessimistic, respectively. Their sum is our rank-miscalibration metric (i.e., \ref{['eqn:rank-ece']}), which here is lower for $U_{\rm NLL}$ than $U_{\rm Ecc}$. See Sec. \ref{['sec:rce+diagram']} for details.
Figure 3: Top: AUROCs of uncertainty/confidence measures with various thresholds. Bottom: Output ranges of uncertainty/confidence measures. Both results are for GPT-3.5-turbo on the TriviaQA benchmark.
Figure 4: Top: Rouge-L correctness distributions of GPT-3.5-turbo on the TriviaQA (left) and Meadow (right) benchmarks. Bottom: AUROCs of assessed measures for GPT-3.5-turbo on Meadow, with Rouge-L correctness and various thresholds.
Figure 5: The assessed results for AUARC (left) and AUPRC (right) of uncertainty/confidence measures for GPT-3.5-turbo on the TriviaQA benchmark using the METEOR correctness score with varying thresholds.
Figure 6: Results for Meadow using GPT-3.5-turbo and the Rouge score.
...and 11 more figures

Theorems & Definitions (4)

Definition 1: Rank-Calibration
Theorem 1
proof
Proposition 1

Uncertainty in Language Models: Assessment through Rank-Calibration

TL;DR

Abstract

Uncertainty in Language Models: Assessment through Rank-Calibration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (4)