REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy
Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, Tagyoung Chung
TL;DR
REAL sampling addresses the open-ended generation challenge of preserving factuality while maintaining diversity by adapting the nucleus threshold using a tiny Token-level Hallucination Forecasting (THF) model. It introduces a parameterization of the entropy decay curve across model sizes, estimates asymptotic entropy $e_c^{AE}$, and derives a residual entropy $d_c^{RE}$ to gauge hallucination hazard, which is converted into a context-aware top-$p$ threshold $\hat{t}_c^p = \exp(-\hat{d}_c^{RE}/T)$. The approach provides a theoretical bound on the decoding threshold and demonstrates substantial improvements on FactualityPrompts for 7B LLMs, with additional gains when combined with contrastive decoding, plus supportive unsupervised signals for hallucination detection. These results suggest that unsupervised, size-aware entropy forecasting can meaningfully enhance factuality and diversity in open-ended generation with broad applicability across LLM families. The work offers practical decoding guidance and points to future directions for integrating THF with more decoding strategies and larger models.
Abstract
Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity. For example, a higher p threshold in the nucleus (top-p) sampling increases the diversity but decreases the factuality, and vice versa. In this paper, we propose REAL (Residual Entropy from Asymptotic Line) sampling, a decoding method that achieves improved factuality and diversity over nucleus sampling by predicting an adaptive threshold of $p$. Specifically, REAL sampling predicts the step-wise likelihood of an LLM to hallucinate, and lowers the p threshold when an LLM is likely to hallucinate. Otherwise, REAL sampling increases the p threshold to boost the diversity. To predict the step-wise hallucination likelihood without supervision, we construct a Token-level Hallucination Forecasting (THF) model to predict the asymptotic entropy (i.e., inherent uncertainty) of the next token by extrapolating the next-token entropies from a series of LLMs with different sizes. If a LLM's entropy is higher than the asymptotic entropy (i.e., the LLM is more uncertain than it should be), the THF model predicts a high hallucination hazard, which leads to a lower p threshold in REAL sampling. In the FactualityPrompts benchmark, we demonstrate that REAL sampling based on a 70M THF model can substantially improve the factuality and diversity of 7B LLMs simultaneously, judged by both retrieval-based metrics and human evaluation. After combined with contrastive decoding, REAL sampling outperforms 9 sampling methods, and generates texts that are more factual than the greedy sampling and more diverse than the nucleus sampling with $p=0.5$. Furthermore, the predicted asymptotic entropy is also a useful unsupervised signal for hallucination detection tasks.
