Table of Contents
Fetching ...

Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation

Mykyta Ielanskyi, Kajetan Schweighofer, Lukas Aichberger, Sepp Hochreiter

TL;DR

The paper identifies biases and instability in how uncertainty estimation for natural language generation is currently evaluated, particularly when using QA benchmarks with approximate correctness signals. It proposes robust risk indicators, including marginalization over multiple judge variants (SP-MoJI), structured-task exact correctness, OOD and perturbation signals, and an Elo-based aggregation to synthesize diverse results. The study demonstrates that UE method performance is highly task-dependent and that naive evaluation can be gamed, advocating for principled, task-aware evaluation protocols. These contributions aim to yield more reliable comparisons and accelerate progress in uncertainty-aware NLG systems.

Abstract

Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.

Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation

TL;DR

The paper identifies biases and instability in how uncertainty estimation for natural language generation is currently evaluated, particularly when using QA benchmarks with approximate correctness signals. It proposes robust risk indicators, including marginalization over multiple judge variants (SP-MoJI), structured-task exact correctness, OOD and perturbation signals, and an Elo-based aggregation to synthesize diverse results. The study demonstrates that UE method performance is highly task-dependent and that naive evaluation can be gamed, advocating for principled, task-aware evaluation protocols. These contributions aim to yield more reliable comparisons and accelerate progress in uncertainty-aware NLG systems.

Abstract

Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, commonly used approximate correctness functions have substantial disagreement between each other and, consequently, in the ranking of the uncertainty estimation methods. This allows one to inflate the apparent performance of uncertainty estimation methods. We propose using several alternative risk indicators for risk correlation experiments that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that marginalizing over multiple LLM-as-a-judge variants leads to reducing the evaluation biases. Furthermore, we explore structured tasks as well as out of distribution and perturbation detection tasks which provide robust and controllable risk indicators. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.

Paper Structure

This paper contains 66 sections, 29 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Approximate correctness consistency on selected QA datasets. R indicates ROUGE family, B - BLEU. judge models are indicated with J, 'q' stands for QA prompt used in Farquhar:2024 (see Sec. \ref{['appendix:prompt_info']} for more details on prompting). (a) Agreement of correctness metrics in terms of mutual AUROC (not symmetric). Column values are binarized at $0.5$ where applicable. (b) Agreement on the ranking of UE algorithms when labeled by the pair of approximate correctness functions. $\rho$ of $1$ indicates identical ordering while $\rho$ of $0$ indicates uncorrelated rank assignment by the two correctness functions.
  • Figure 2: Bootstrap estimate of the standard deviation of mean of AUC performance on selected QA dataset / model combinations. As a rule of thumb, using SP-MoJI with $4$ judges reduces the standard deviation of performance estimator twofold. For implementation details refer to \ref{['appx:sec:sd_estimator_details']}.
  • Figure 3: Correctness consistency on structured datasets. R indicates ROUGE family, B - BLEU. judge models are indicated with J, 'q' stands for QA prompt used in Farquhar:2024 while 'g' stands for a more general prompt to evaluate correctness. (a) Agreement of correctness metrics in terms of mutual AUROC (not symmetric). Column values are binarized at $0.5$ where applicable. (b) Correlation of UE algorithm orderings when compared between corresponding pairs of correctness functions.
  • Figure 4: Elo ratings of NLG uncertainty estimation methods. The methods are grouped by color according to their category (see Apx. \ref{['appendix:considered_uncertainty_methods']}). The line at 1000 Elo indicates the average rating. Elo rating were independently estimated for several key partitions. Per task used: QA - selective prediction on QA datasets, C.TEXT - constrained text generation, CODE - code completion. Per models used: IT - instruction fine tuned models only, PT - pretrained models only. Finally, we report the partitions of the alternative risk indicators: OOD - out-of-distribution and PERT - perturbation.
  • Figure 5: Agreement of ordering UE methods on TruthfulQA.
  • ...and 4 more figures