Table of Contents
Fetching ...

Rethinking Uncertainty Estimation in Natural Language Generation

Lukas Aichberger, Kajetan Schweighofer, Sepp Hochreiter

TL;DR

This paper tackles the challenge of estimating predictive uncertainty in natural language generation without the heavy computational burden of sampling many output sequences. It grounds uncertainty measures in proper scoring rules and derives the negative log-likelihood of the most likely sequence under greedy decoding (G-NLL) as an efficient, theoretically principled uncertainty proxy. The approach reframes uncertainty from the conventional log-score perspective, replacing it with a zero-one score, enabling accurate uncertainty estimation from a single decoded sequence while maintaining rigor. Across QA-style tasks and diverse models, G-NLL matches or surpasses state-of-the-art, especially for concise outputs, while offering substantial computational savings and practical applicability for real-world NLG systems.

Abstract

Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty. However, generating output sequences is computationally expensive, making these methods impractical at scale. In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure. To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding. This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.

Rethinking Uncertainty Estimation in Natural Language Generation

TL;DR

This paper tackles the challenge of estimating predictive uncertainty in natural language generation without the heavy computational burden of sampling many output sequences. It grounds uncertainty measures in proper scoring rules and derives the negative log-likelihood of the most likely sequence under greedy decoding (G-NLL) as an efficient, theoretically principled uncertainty proxy. The approach reframes uncertainty from the conventional log-score perspective, replacing it with a zero-one score, enabling accurate uncertainty estimation from a single decoded sequence while maintaining rigor. Across QA-style tasks and diverse models, G-NLL matches or surpasses state-of-the-art, especially for concise outputs, while offering substantial computational savings and practical applicability for real-world NLG systems.

Abstract

Large Language Models (LLMs) are increasingly employed in real-world applications, driving the need to evaluate the trustworthiness of their generated text. To this end, reliable uncertainty estimation is essential. Since current LLMs generate text autoregressively through a stochastic process, the same prompt can lead to varying outputs. Consequently, leading uncertainty estimation methods generate and analyze multiple output sequences to determine the LLM's uncertainty. However, generating output sequences is computationally expensive, making these methods impractical at scale. In this work, we inspect the theoretical foundations of the leading methods and explore new directions to enhance their computational efficiency. Building on the framework of proper scoring rules, we find that the negative log-likelihood of the most likely output sequence constitutes a theoretically grounded uncertainty measure. To approximate this alternative measure, we propose G-NLL, which has the advantage of being obtained using only a single output sequence generated by greedy decoding. This makes uncertainty estimation more efficient and straightforward, while preserving theoretical rigor. Empirical results demonstrate that G-NLL achieves state-of-the-art performance across various LLMs and tasks. Our work lays the foundation for efficient and reliable uncertainty estimation in natural language generation, challenging the necessity of more computationally involved methods currently leading the field.

Paper Structure

This paper contains 24 sections, 11 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Quality of estimators for synthetic predictive distributions $p(\bm{y} \mid \bm{x})$ with $|\mathcal{V}|=20$ and $T=4$. The predictive entropy $\mathbf{\mathrm{H}} ( p(\bm{y} \mid \bm{x}) )$ is estimated as in Eq. \ref{['eq:predictive_entropy']} using multinomial sampling (MS) with different temperatures ($\tau$). The maximum sequence likelihood $\max_{\bm{y}} p(\bm{y} \mid \bm{x})$ is estimated by the maximum over $N$ samples obtained by beam search ($N=1$ represents greedy decoding) or MS with different $\tau$. Statistics are obtained by sampling different $p(\bm{y} \mid \bm{x})$. (a) Lines show average, shades denote one standard deviation. (b) Lines show median, shades denote 5% to 95% quantile.
  • Figure 2: Average AUROC over TriviaQA instances, using the Llama-3.1-8B model to generate short phrase answers. The ten output sequences for the baselines are generated with their best hyperparameter setting. The one output sequence for NLL is generated with a specific decoding method.
  • Figure 3: Exemplary predictive distributions $p(y_t \mid \bm{y}_{<t}, \bm{x})$ for different vocabulary sizes (widths).
  • Figure 4: Estimator of Predictive Entropy. Results for different vocabulary sizes (width) and sequence lengths (depth). We estimate the entropy $\mathbf{\mathrm{H}} (p(\bm{y} \mid \bm{x}))$ using $N$ Monte-Carlo samples (c.f. Eq. \ref{['eq:predictive_entropy']}). Lines denote the average over runs, while shades denote one standard deviation. We compare multinomial sampling (MS) for two commonly used temperatures. The experiments show that the decreased temperature leads to lower variance, but introduces bias.
  • Figure 5: Estimator of maximum sequence likelihood. Results for different vocabulary sizes (width) and sequence lengths (depth). We estimate $\max_{\bm{y}} p(\bm{y} \mid \bm{x})$ using the maximum over $N$ sampled obtained by beam search ($N=1$ is greedy decoding) or multinomial sampling (MS) with different temperatures. Lines denote the median, shades signify the possible values between the 5 and 95 percent quantile. Beam search is deterministic for any given distribution $p(\bm{y} \mid \bm{x})$. Even with a very low number of samples, low-temperature multinomial sampling (MS) and beam search are able to find the maximum with high probability.