Table of Contents
Fetching ...

Estimating LLM Uncertainty with Evidence

Huan Ma, Jingdong Chen, Joey Tianyi Zhou, Guangyu Wang, Changqing Zhang

TL;DR

The paper tackles hallucinations in LLMs by criticizing probability-based uncertainty metrics that lose evidence strength during normalization. It introduces LogTokU, a logits-based framework that models token-level uncertainty as evidence via a Dirichlet distribution, decoupling aleatoric and epistemic uncertainty into four quadrants. The authors apply LogTokU to two downstream tasks: (1) dynamic decoding that adapts sampling based on uncertainty to balance diversity and accuracy, and (2) reliability estimation that aggregates token uncertainty into sentence-level reliability, demonstrated on SemEval and TruthfulQA benchmarks. Empirical results show LogTokU outperforms baselines in both decoding and reliability estimation across multiple LLM sizes, highlighting its efficiency and practical potential for robust, uncertainty-aware generation.

Abstract

Over the past few years, Large Language Models (LLMs) have developed rapidly and are widely applied in various domains. However, LLMs face the issue of hallucinations, generating responses that may be unreliable when the models lack relevant knowledge. To be aware of potential hallucinations, uncertainty estimation methods have been introduced, and most of them have confirmed that reliability lies in critical tokens. However, probability-based methods perform poorly in identifying token reliability, limiting their practical utility. In this paper, we reveal that the probability-based method fails to estimate token reliability due to the loss of evidence strength information which is accumulated in the training stage. Therefore, we present Logits-induced token uncertainty (LogTokU), a framework for estimating decoupled token uncertainty in LLMs, enabling real-time uncertainty estimation without requiring multiple sampling processes. We employ evidence modeling to implement LogTokU and use the estimated uncertainty to guide downstream tasks. The experimental results demonstrate that LogTokU has significant effectiveness and promise.

Estimating LLM Uncertainty with Evidence

TL;DR

The paper tackles hallucinations in LLMs by criticizing probability-based uncertainty metrics that lose evidence strength during normalization. It introduces LogTokU, a logits-based framework that models token-level uncertainty as evidence via a Dirichlet distribution, decoupling aleatoric and epistemic uncertainty into four quadrants. The authors apply LogTokU to two downstream tasks: (1) dynamic decoding that adapts sampling based on uncertainty to balance diversity and accuracy, and (2) reliability estimation that aggregates token uncertainty into sentence-level reliability, demonstrated on SemEval and TruthfulQA benchmarks. Empirical results show LogTokU outperforms baselines in both decoding and reliability estimation across multiple LLM sizes, highlighting its efficiency and practical potential for robust, uncertainty-aware generation.

Abstract

Over the past few years, Large Language Models (LLMs) have developed rapidly and are widely applied in various domains. However, LLMs face the issue of hallucinations, generating responses that may be unreliable when the models lack relevant knowledge. To be aware of potential hallucinations, uncertainty estimation methods have been introduced, and most of them have confirmed that reliability lies in critical tokens. However, probability-based methods perform poorly in identifying token reliability, limiting their practical utility. In this paper, we reveal that the probability-based method fails to estimate token reliability due to the loss of evidence strength information which is accumulated in the training stage. Therefore, we present Logits-induced token uncertainty (LogTokU), a framework for estimating decoupled token uncertainty in LLMs, enabling real-time uncertainty estimation without requiring multiple sampling processes. We employ evidence modeling to implement LogTokU and use the estimated uncertainty to guide downstream tasks. The experimental results demonstrate that LogTokU has significant effectiveness and promise.

Paper Structure

This paper contains 33 sections, 1 theorem, 25 equations, 4 figures, 4 tables.

Key Result

Theorem 1

For any LLM $\mathcal{M}$ trained with the cross-entropy loss $L_{\text{CE}}$ using gradient descent optimization (i.e., $\nabla_\mathcal{M} L_{\text{CE}}$), the total evidence $\sum_{\tau^i \in \mathcal{T}} z_{\tau^i}$ will strictly accumulate (i.e., $\Delta\sum_{\tau^i \in \mathcal{T}} z_{\tau^i}>

Figures (4)

  • Figure 1: Why probability-based methods fail?Left: A pair of examples on LLaMA-2 shows that probability fails in estimating reliability. Since LLMs know the names of many presidents, the probability after normalization is very low; whereas for the future of the universe, since LLMs only know one hypothesis, the probability is very high. The probability-based reliability measure is counterintuitive, as the answers on common sense questions, where LLMs have rich knowledge, are less reliable than on unsolved physics problems. This is because probability cannot reflect whether a low probability is due to LLMs knowing multiple correct answers. These two cases are well characterized in this paper, corresponding to the fourth and second quadrants of Fig. \ref{['fig:cover1']}, respectively. Right: Normalization leads to the loss of evidence strength information.
  • Figure 2: Why LogTokU works?Left: Illustration of four different scenarios considered in LogTokU, where the gray bars represent the logits for predicting the next token, the triangular patterns represent the corresponding Dirichlet distribution, and the table below compares uncertainty estimation using probability with that using LogTokU. Right: A case study from a medical QA, where the markings under each word reflect reliability estimated according to LogTokU, as well as the values of AU (gray) and EU (blue). I: Both AU and EU are high, where LLaMA recommends a metal "Chromium" for diabetes patients. II: The total logits are low, but one token's logit is larger than the others, indicating that the LLM lacks experience and knowledge but knows what should be the next token, where the LLM repeats the medicine "Glucomannan" that has been generated in the previous context. III: The LLM is very confident about the next token, where it generates the fixed phrase "has been". IV: The LLM has enough knowledge and knows more than one suitable answer. For example, the LLM generates "[comma]", which can be replaced by many other suitable words. The dilemma in Fig. 1 is addressed according to quadrant II and IV.
  • Figure 3: Illustration of experimental setting in Table \ref{['tab:percent']}.
  • Figure 4: A close-up observation explains why LogTokU achieves the best performance. All samples are sorted by reliability from high to low. The "performance" is the accumulated score (i.e., the number of accumulated correct responses minus the number of accumulated incorrect responses). A trend of increasing and then decreasing represents a good reliability indicator, where the answer becomes more likely to be wrong as reliability decreases.

Theorems & Definitions (1)

  • Theorem 1