Distinguishing the Knowable from the Unknowable with Language Models

Gustaf Ahdritz; Tian Qin; Nikhil Vyas; Boaz Barak; Benjamin L. Edelman

Distinguishing the Knowable from the Unknowable with Language Models

Gustaf Ahdritz, Tian Qin, Nikhil Vyas, Boaz Barak, Benjamin L. Edelman

TL;DR

This work investigates distinguishing epistemic from aleatoric uncertainty in unconstrained text produced by large language models. It uses a larger model as a ground-truth proxy and trains lightweight probes on frozen smaller models to predict when the larger model will be confident at the next-token level, complemented by a fully unsupervised In-Context Learning Test (ICLT) to detect uncertainty types. The supervised approach achieves high AUROC (often >0.9) across model pairings and data domains, with transfer to out-of-domain sets, while the unsupervised ICLT approach yields non-trivial accuracy and reveals the importance of prompt structure and separator tokens. These results suggest that LLMs contain internal representations of different uncertainty types that could be leveraged to improve confidence estimation and reduce hallucinations in practical deployments.

Abstract

We study the feasibility of identifying epistemic uncertainty (reflecting a lack of knowledge), as opposed to aleatoric uncertainty (reflecting entropy in the underlying distribution), in the outputs of large language models (LLMs) over free-form text. In the absence of ground-truth probabilities, we explore a setting where, in order to (approximately) disentangle a given LLM's uncertainty, a significantly larger model stands in as a proxy for the ground truth. We show that small linear probes trained on the embeddings of frozen, pretrained models accurately predict when larger models will be more confident at the token level and that probes trained on one text domain generalize to others. Going further, we propose a fully unsupervised method that achieves non-trivial accuracy on the same task. Taken together, we interpret these results as evidence that LLMs naturally contain internal representations of different types of uncertainty that could potentially be leveraged to devise more informative indicators of model confidence in diverse practical settings.

Distinguishing the Knowable from the Unknowable with Language Models

TL;DR

Abstract

Paper Structure (41 sections, 4 equations, 18 figures, 14 tables)

This paper contains 41 sections, 4 equations, 18 figures, 14 tables.

Introduction
Related work
Identifying epistemic uncertainty
In-context learning
Setup
High-level task description
Models
Datasets
Baselines
Supervised experiments
Training details
Binary classification (with gap)
Binary classification (without gap)
Unsupervised experiments
Synthetic proof of concept
...and 26 more sections

Figures (18)

Figure 1: Simplified overview of our classification task and the supervised method.Left: The supervised uncertainty classification task. A dataset of labeled prompts is created by running a large LLM on existing text and thresholding its predictive entropy near zero. Right: We train probes on activations of a smaller target model to predict the resulting labels without access to the large model.
Figure 2: The unsupervised ICLT method.Left: Illustration of the unsupervised In-Context Learning Test (ICLT) method. Given a prompt, we first use the small model to produce several next-token candidates. For each candidate, we create a new prompt of the form $\langle\texttt{orig-prompt}\rangle \langle\texttt{candidate}\rangle \langle\texttt{orig-prompt}\rangle$, with document separator tokens before each copy of $\langle\texttt{orig-prompt}\rangle$. We feed these repetition prompts back into the small model, and obtain new next-token distributions. Finally, we use the minimum of the entropies of these distributions as a predictor of the uncertainty of the large model (see the classification task defined in Figure \ref{['overview_figure']}. In particular, we expect that when the small model's uncertainty is primarily epistemic, it will be more liable to repeat the completion provided in its context, effectively updating on this "new information".
Figure 3: High-level classification results for our supervised and unsupervised methods. ROC curves for linear probes trained with the model pairing LLaMA 7B / LLaMA 65B, evaluated both in (left) and out (middle) of the distribution of the probes' Wikipedia training data. All classifier probes are evaluated on balanced test sets. Right: ROC curve for unsupervised ICLT method with the model pairing LLaMA 7B / LLaMA 65B evaluated on the Wikipedia test set.
Figure 4: Given a knowledgeable enough "large" model, tokens where a small model is unconfident while the large model is confident can be thought of as instances of epistemic uncertainty on the part of the small model. Snippets from Wikipedia articles with tokens highlighted wherever the conditional predictive entropies of a pair of small and large language models (LLaMAs 7B and 30B) differ. For tokens where the entropy difference is larger than 2 bits, we also display the small model's top 3 token predictions in parentheses. Qualitatively, we observe that tokens at which the large model has near-zero entropy but the small one does not tend to be "epistemic" in nature, corresponding to e.g. dates, people, and specific technical vocabulary. See Appendix \ref{['disagreements_autoregressive']} for an autoregressive version of the same.
Figure 5: Small model entropy and large model entropy are heavily correlated. Heatmaps (with log-scale color schemes) of the entropies of token probabilities output by smaller and larger language models. Across model types and datasets, these values are heavily correlated. To rule out that our heads learn to rely on this fact, we train separate heads on tokens from narrow bands of the small model's predictive entropy (example band highlighted in green) within which the correlation is negligible.
...and 13 more figures

Distinguishing the Knowable from the Unknowable with Language Models

TL;DR

Abstract

Distinguishing the Knowable from the Unknowable with Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (18)