Table of Contents
Fetching ...

REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, Tagyoung Chung

TL;DR

REAL sampling addresses the open-ended generation challenge of preserving factuality while maintaining diversity by adapting the nucleus threshold using a tiny Token-level Hallucination Forecasting (THF) model. It introduces a parameterization of the entropy decay curve across model sizes, estimates asymptotic entropy $e_c^{AE}$, and derives a residual entropy $d_c^{RE}$ to gauge hallucination hazard, which is converted into a context-aware top-$p$ threshold $\hat{t}_c^p = \exp(-\hat{d}_c^{RE}/T)$. The approach provides a theoretical bound on the decoding threshold and demonstrates substantial improvements on FactualityPrompts for 7B LLMs, with additional gains when combined with contrastive decoding, plus supportive unsupervised signals for hallucination detection. These results suggest that unsupervised, size-aware entropy forecasting can meaningfully enhance factuality and diversity in open-ended generation with broad applicability across LLM families. The work offers practical decoding guidance and points to future directions for integrating THF with more decoding strategies and larger models.

Abstract

Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity. For example, a higher p threshold in the nucleus (top-p) sampling increases the diversity but decreases the factuality, and vice versa. In this paper, we propose REAL (Residual Entropy from Asymptotic Line) sampling, a decoding method that achieves improved factuality and diversity over nucleus sampling by predicting an adaptive threshold of $p$. Specifically, REAL sampling predicts the step-wise likelihood of an LLM to hallucinate, and lowers the p threshold when an LLM is likely to hallucinate. Otherwise, REAL sampling increases the p threshold to boost the diversity. To predict the step-wise hallucination likelihood without supervision, we construct a Token-level Hallucination Forecasting (THF) model to predict the asymptotic entropy (i.e., inherent uncertainty) of the next token by extrapolating the next-token entropies from a series of LLMs with different sizes. If a LLM's entropy is higher than the asymptotic entropy (i.e., the LLM is more uncertain than it should be), the THF model predicts a high hallucination hazard, which leads to a lower p threshold in REAL sampling. In the FactualityPrompts benchmark, we demonstrate that REAL sampling based on a 70M THF model can substantially improve the factuality and diversity of 7B LLMs simultaneously, judged by both retrieval-based metrics and human evaluation. After combined with contrastive decoding, REAL sampling outperforms 9 sampling methods, and generates texts that are more factual than the greedy sampling and more diverse than the nucleus sampling with $p=0.5$. Furthermore, the predicted asymptotic entropy is also a useful unsupervised signal for hallucination detection tasks.

REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy

TL;DR

REAL sampling addresses the open-ended generation challenge of preserving factuality while maintaining diversity by adapting the nucleus threshold using a tiny Token-level Hallucination Forecasting (THF) model. It introduces a parameterization of the entropy decay curve across model sizes, estimates asymptotic entropy , and derives a residual entropy to gauge hallucination hazard, which is converted into a context-aware top- threshold . The approach provides a theoretical bound on the decoding threshold and demonstrates substantial improvements on FactualityPrompts for 7B LLMs, with additional gains when combined with contrastive decoding, plus supportive unsupervised signals for hallucination detection. These results suggest that unsupervised, size-aware entropy forecasting can meaningfully enhance factuality and diversity in open-ended generation with broad applicability across LLM families. The work offers practical decoding guidance and points to future directions for integrating THF with more decoding strategies and larger models.

Abstract

Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity. For example, a higher p threshold in the nucleus (top-p) sampling increases the diversity but decreases the factuality, and vice versa. In this paper, we propose REAL (Residual Entropy from Asymptotic Line) sampling, a decoding method that achieves improved factuality and diversity over nucleus sampling by predicting an adaptive threshold of . Specifically, REAL sampling predicts the step-wise likelihood of an LLM to hallucinate, and lowers the p threshold when an LLM is likely to hallucinate. Otherwise, REAL sampling increases the p threshold to boost the diversity. To predict the step-wise hallucination likelihood without supervision, we construct a Token-level Hallucination Forecasting (THF) model to predict the asymptotic entropy (i.e., inherent uncertainty) of the next token by extrapolating the next-token entropies from a series of LLMs with different sizes. If a LLM's entropy is higher than the asymptotic entropy (i.e., the LLM is more uncertain than it should be), the THF model predicts a high hallucination hazard, which leads to a lower p threshold in REAL sampling. In the FactualityPrompts benchmark, we demonstrate that REAL sampling based on a 70M THF model can substantially improve the factuality and diversity of 7B LLMs simultaneously, judged by both retrieval-based metrics and human evaluation. After combined with contrastive decoding, REAL sampling outperforms 9 sampling methods, and generates texts that are more factual than the greedy sampling and more diverse than the nucleus sampling with . Furthermore, the predicted asymptotic entropy is also a useful unsupervised signal for hallucination detection tasks.
Paper Structure (37 sections, 1 theorem, 12 equations, 16 figures, 5 tables)

This paper contains 37 sections, 1 theorem, 12 equations, 16 figures, 5 tables.

Key Result

Theorem 3.1

If the residual entropy is estimated accurately (i.e., $\hat{d}_c^{RE}=d_c^{RE}$), and there is an ideal threshold $g_c^p$ such that the distribution of the top tokens above the threshold is ideal, then

Figures (16)

  • Figure 1: (a) For the factual question, only a few next tokens are correct but the target LLM assigns high probabilities to many tokens, so our THF model predicts the next token from the LLM is likely to be incorrect if using a large $p$ threshold. (b) For the beginning of a sentence, many tokens could be used, so our THF model predicts that sampling from more tokens increases the diversity without hurting the factuality.
  • Figure 2: The entropies of the Pythia’s distributions versus the model size in a logarithmic scale. The entropies are averaged across all tokens in a Wikipedia subset. The blue entropy decay curve plots actual entropies from Pythia LMs; the green curve is the entropies predicted by our THF model.
  • Figure 3: Given the input context, the LLMs with different sizes generate the next-token distributions. By extrapolating the curve using a tiny THF model, we estimate the asymptotic entropy, the entropy from an imaginary LLM with an infinite size (i.e., the inherent uncertainty of the next token), and compute the residual entropy as a measurement of the hallucination hazard. (a) The LLM's entropy is much higher than the asymptotic entropy. This implies that the LLM is more uncertain than it should be and thus likely to hallucinate next. (b) LLM's high entropy is fine because the next token is inherently uncertain.
  • Figure 4: The architecture and the training of the THF model. We use the THF model to predict the parameters of the entropy decay curves and we train the THF model by minimizing the distances between the predicted entropy curves and the empirical entropies from the LLM family.
  • Figure 5: Open-ended text generation performance comparison between REAL sampling and state-of-the-art unsupervised thresholding methods, including top-$p$holtzman2019curious, eta hewitt2022truncation, and typical meister2022typical sampling. The factuality and diversity are evaluated using the FactualityPrompts benchmark from lee2022factuality. We also conduct an ablation study and compare REAL sampling with the distribution modification methods including temperature sampling ficler2017controlling, contrastive search su2022contrastive (CS), contrastive decoding (CD) li2022contrastive and DoLa chuang2023dola. See more comparisons at \ref{['fig:K_abaltion']}, \ref{['fig:decay_function']}, and \ref{['fig:comp_gen_more']}.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof