Table of Contents
Fetching ...

Improving Uncertainty Estimation through Semantically Diverse Language Generation

Lukas Aichberger, Kajetan Schweighofer, Mykyta Ielanskyi, Sepp Hochreiter

TL;DR

The paper tackles hallucinations in autoregressive LLMs by reframing them as semantic uncertainty in NLG and proposing Semantically Diverse Language Generation (SDLG) to quantify it. It builds a theoretical foundation for semantic entropy, derives estimators based on importance sampling, and introduces a token-level substitution mechanism guided by attribution, substitution, and importance scores to generate semantically diverse yet likely outputs. Empirically, SDLG outperforms existing uncertainty estimators on free-form QA benchmarks (TruthfulQA, CoQA, TriviaQA) across OPT models, while reducing per-sample computational cost; it also increases semantic cluster coverage compared to baseline sampling. The work advances reliable uncertainty estimation in NLG and offers a practical, hyperparameter-light approach to stress-test LLMs for improved trustworthiness, while acknowledging limitations such as the single-cluster assumption and suggesting avenues for exploring epistemic semantic uncertainty in future work.

Abstract

Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.

Improving Uncertainty Estimation through Semantically Diverse Language Generation

TL;DR

The paper tackles hallucinations in autoregressive LLMs by reframing them as semantic uncertainty in NLG and proposing Semantically Diverse Language Generation (SDLG) to quantify it. It builds a theoretical foundation for semantic entropy, derives estimators based on importance sampling, and introduces a token-level substitution mechanism guided by attribution, substitution, and importance scores to generate semantically diverse yet likely outputs. Empirically, SDLG outperforms existing uncertainty estimators on free-form QA benchmarks (TruthfulQA, CoQA, TriviaQA) across OPT models, while reducing per-sample computational cost; it also increases semantic cluster coverage compared to baseline sampling. The work advances reliable uncertainty estimation in NLG and offers a practical, hyperparameter-light approach to stress-test LLMs for improved trustworthiness, while acknowledging limitations such as the single-cluster assumption and suggesting avenues for exploring epistemic semantic uncertainty in future work.

Abstract

Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.
Paper Structure (42 sections, 21 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 42 sections, 21 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Using standard multinomial sampling to generate text does not account for its semantics. Thus, it relies on chance to obtain semantically diverse output sequences and is prone to miss them. SDLG addresses this by specifically searching for likely, but semantically different output sequences. Thereby, the estimation of semantic uncertainty in language models is improved.
  • Figure 2: SDLG
  • Figure 3: (a) AUROC using uncertainty measures across various numbers of samples as score to distinguish between correct and incorrect answers of the CoQA dataset. Solid and dotted lines indicate the performance when using the proper and improper semantic entropy estimator, respectively. (b) Average number of semantic clusters found across various numbers of samples considered.
  • Figure 4: Average number of Teraflops required for an increasing number of samples generated with SDLG vs. standard multinomial sampling (MS) and Shifting Attention to Relevance (SAR).
  • Figure 5: Synthetic example of approximating a cluster distribution $p(c)$ of an underlying probability distribution $p(y)$. In (a) the distributions are shown. In (b), the bias and variance over 200 runs of the MC approximations per number of samples using Eq. \ref{['eq:semantic_cluster_distribution_mc']} (blue) and Eq. \ref{['eq:semantic_cluster_distribution_is_kuhn']} (orange) are given.
  • ...and 6 more figures