Table of Contents
Fetching ...

Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

Haoyi Song, Ruihan Ji, Naichen Shi, Fan Lai, Raed Al Kontar

TL;DR

This work tackles the challenge of uncertainty quantification for large language models, where token-probabilities and simple perturbations often fail to reflect true uncertainty. It introduces Inv-Entropy, a fully probabilistic measure derived from a dual random-walk model that links perturbed inputs to outputs via embeddings and similarity-based transitions, and uses bootstrapping to estimate $H(X|Y)$. The framework is augmented with GAAP, a genetic-algorithm-based perturbation method, and a new evaluation metric TSU to assess uncertainty without ground-truth correctness. Empirically, Inv-Entropy achieves state-of-the-art performance across multiple QA, knowledge, and math tasks on both black-box and gray-box LLMs, underscoring the framework’s flexibility and practical impact for reliable AI deployment.

Abstract

Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic interpretation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.

Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models

TL;DR

This work tackles the challenge of uncertainty quantification for large language models, where token-probabilities and simple perturbations often fail to reflect true uncertainty. It introduces Inv-Entropy, a fully probabilistic measure derived from a dual random-walk model that links perturbed inputs to outputs via embeddings and similarity-based transitions, and uses bootstrapping to estimate . The framework is augmented with GAAP, a genetic-algorithm-based perturbation method, and a new evaluation metric TSU to assess uncertainty without ground-truth correctness. Empirically, Inv-Entropy achieves state-of-the-art performance across multiple QA, knowledge, and math tasks on both black-box and gray-box LLMs, underscoring the framework’s flexibility and practical impact for reliable AI deployment.

Abstract

Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic interpretation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.

Paper Structure

This paper contains 41 sections, 1 theorem, 28 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Lemma 2.1

Assume (1) $\hat{f}$ is twice differentiable, and both $\|\nabla \hat{f}(x)\|$ and $\|\nabla^2 \hat{f}(x)\|_{\texttt{op}}$ are bounded for all $x \in \mathbb{R}^d$, and (2) $\nabla \hat{f}(x_0) \neq 0$ and $\nabla f^{\star}(x_0) \neq 0$. Then, for sufficiently small $\sigma$, we have where $\theta(v_1, v_2) = \arccos\left(\frac{v_1^{\top} v_2}{\|v_1\| \|v_2\|}\right)$ denotes the angle between tw

Figures (5)

  • Figure 1: Toy example highlighting the importance of perturbations. The original question is from TriviaQA joshi2017triviaqa, and the correct answer is “bras.” The responses are generated by ChatGPT-3.5-Turbo. Input perturbations reveal hidden variability that multiple sampling (i.e., replications) alone fails to capture, as replication alone can be confidently wrong.
  • Figure 2: Left: Conceptual illustration of level sets of the ground truth $f^{\star}$. We perturb the input $x_0$ to $x_1, x_2, \ldots$ along a schematic isocontour such that $f^{\star}(x_0) = f^{\star}(x_1) = \cdots$. Right: Conceptual illustration of level sets of the model $\hat{f}$. The deviations of $\hat{f}(x_i)$ for $i \ge 1$ from $\hat{f}(x_0)$ reflect the model’s uncertainty around $x_0$.
  • Figure 3: Random-walk transitions underlying $P(X \mid Y) = \text{P}_y \text{P}_x$. Highlighted blue paths show two representative transitions (one through $k$ and one through $n$), each following $y_j \xrightarrow{\text{P}_y} y_k \xrightarrow{\text{LLM}} x_k \xrightarrow{\text{P}_x} x_i$.
  • Figure 4: Illustration of GAAP on a TriviaQA joshi2017triviaqa question.
  • Figure 5: AUROC of Inv-Entropy under different perturbation methods (GAAP or ChatGPT-based paraphrasing) and embedding functions, on both ChatGPT and LLaMA models.

Theorems & Definitions (1)

  • Lemma 2.1