Table of Contents
Fetching ...

Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

Manh Nguyen, Sunil Gupta, Hung Le

TL;DR

This paper tackles the challenge of reliably estimating uncertainty in open-ended LLM outputs. It proposes Radial Dispersion Score, a simple, parameter-free metric computed from external embeddings that captures the spread of sampled generations on a unit sphere, with a probability-weighted variant that leverages token probabilities when available. The authors show that RDS and RDS_w outperform a broad set of baselines across multiple datasets and models, and that the per-sample scores enable effective best-of-N selection and confidence-based filtering. The approach is model-agnostic, scalable, and robust to embedding choices and sampling budgets, offering a practical tool for reducing hallucinations and improving decision-making with LLMs. Limitations include the need for an external encoder and multiple samples, particularly in black-box settings.

Abstract

Detecting when large language models (LLMs) are uncertain is critical for building reliable systems, yet existing methods are overly complicated, relying on brittle semantic clustering or internal states. We introduce \textbf{Radial Dispersion Score (RDS)}, a simple, parameter-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. A lightweight probability-weighted variant further incorporates the model's own token probabilities when available, outperforming different nine strong baselines. Moroever, RDS naturally extends to per-sample scoring, enabling applications such as best-of-$N$ selection and confidence-based filtering. Across four challenging free-form QA datasets and multiple LLMs, our metrics achieve state-of-the-art hallucination detection and answer selection performance, while remaining robust and scalable with respect to sample size and embedding choice.

Distance Is All You Need: Radial Dispersion for Uncertainty Estimation in Large Language Models

TL;DR

This paper tackles the challenge of reliably estimating uncertainty in open-ended LLM outputs. It proposes Radial Dispersion Score, a simple, parameter-free metric computed from external embeddings that captures the spread of sampled generations on a unit sphere, with a probability-weighted variant that leverages token probabilities when available. The authors show that RDS and RDS_w outperform a broad set of baselines across multiple datasets and models, and that the per-sample scores enable effective best-of-N selection and confidence-based filtering. The approach is model-agnostic, scalable, and robust to embedding choices and sampling budgets, offering a practical tool for reducing hallucinations and improving decision-making with LLMs. Limitations include the need for an external encoder and multiple samples, particularly in black-box settings.

Abstract

Detecting when large language models (LLMs) are uncertain is critical for building reliable systems, yet existing methods are overly complicated, relying on brittle semantic clustering or internal states. We introduce \textbf{Radial Dispersion Score (RDS)}, a simple, parameter-free, fully model-agnostic uncertainty metric that measures the radial dispersion of sampled generations in embedding space. A lightweight probability-weighted variant further incorporates the model's own token probabilities when available, outperforming different nine strong baselines. Moroever, RDS naturally extends to per-sample scoring, enabling applications such as best-of- selection and confidence-based filtering. Across four challenging free-form QA datasets and multiple LLMs, our metrics achieve state-of-the-art hallucination detection and answer selection performance, while remaining robust and scalable with respect to sample size and embedding choice.

Paper Structure

This paper contains 38 sections, 2 theorems, 24 equations, 3 figures, 5 tables.

Key Result

Proposition 1

Let $\{\mathbf{x}_i\}_{i=1}^N \subset \mathbb{R}^d$ be unit-norm embeddings ($\|\mathbf{x}_i\|_2 = 1$). Let the centered embeddings be $\mathbf{y}_i = \mathbf{x}_i - \bar{\mathbf{x}}$. Then: Equality in (2) holds if and only if all $\mathbf{x}_i$ are identical. The gap becomes larger as $\bar{\mathbf{x}} \to \mathbf{0}$.

Figures (3)

  • Figure 1: RDS vs EigenEmbed across three uncertainty regimes. (1) Collapsed: RDS $\approx$ EigenEmbed $\approx 0$. (2) Isotropic: RDS $\approx \sqrt{N}$, EigenEmbed $\approx 0.8$--$0.9$. (3) Bimodal: $\bar{\mathbf{x}} \to \mathbf{0}$, RDS $\geq \sqrt{N}$, EigenEmbed $\approx 1$ (maximum gap). RDS is more sensitive to high-uncertainty regimes with semantic diversity.
  • Figure 2: Ablation on the number of sampled responses $N$. Only top baselines are selected for illustration. Detailed results of all methods are provided in Appendix \ref{['app:sample-details']}.
  • Figure 3: Effect of the sentence embedding model. Results are reported on SVAMP (a) and GPQA (b) using $N{=}10$ sampled responses. Detailed results are provided in Appendix \ref{['app:embedding-details']}.

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • proof
  • proof