Table of Contents
Fetching ...

Investigating Human-Aligned Large Language Model Uncertainty

Kyle Moore, Jesse Roberts, Daryl Watson, Pamela Wisniewski

TL;DR

This work addresses aligning LLM uncertainty with human uncertainty to improve trust and safety in the presence of model hallucinations. It surveys and compares eight uncertainty measures across decoder-only LLMs using non-factual human survey data, and introduces nucleus size (NS), top-k entropy (KE), and choice entropy (CE) as new tools for alignment. A key finding is that combinations of measures can approximate human uncertainty with reduced dependence on model size, with KE often offering the strongest human-alignment among single measures. The results have practical implications for reporting and utilizing uncertainty in human-AI interaction, suggesting that hybrid uncertainty representations may enhance robustness and user trust in real-world applications.

Abstract

Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.

Investigating Human-Aligned Large Language Model Uncertainty

TL;DR

This work addresses aligning LLM uncertainty with human uncertainty to improve trust and safety in the presence of model hallucinations. It surveys and compares eight uncertainty measures across decoder-only LLMs using non-factual human survey data, and introduces nucleus size (NS), top-k entropy (KE), and choice entropy (CE) as new tools for alignment. A key finding is that combinations of measures can approximate human uncertainty with reduced dependence on model size, with KE often offering the strongest human-alignment among single measures. The results have practical implications for reporting and utilizing uncertainty in human-AI interaction, suggesting that hybrid uncertainty representations may enhance robustness and user trust in real-world applications.

Abstract

Recent work has sought to quantify large language model uncertainty to facilitate model control and modulate user trust. Previous works focus on measures of uncertainty that are theoretically grounded or reflect the average overt behavior of the model. In this work, we investigate a variety of uncertainty measures, in order to identify measures that correlate with human group-level uncertainty. We find that Bayesian measures and a variation on entropy measures, top-k entropy, tend to agree with human behavior as a function of model size. We find that some strong measures decrease in human-similarity with model size, but, by multiple linear regression, we find that combining multiple uncertainty measures provide comparable human-alignment with reduced size-dependency.

Paper Structure

This paper contains 19 sections, 4 figures.

Figures (4)

  • Figure 1: Prompt to measure presence/absence belief.
  • Figure 2: Correlation between uncertainty in human response data and LLM uncertainty across all uncertainty measures. Measures are ordered by mean correlation across models.
  • Figure 3: Correlation between measure human-similarity and model size across all models. Measures are ordered by correlation with model size.
  • Figure 4: Accuracy of linear regression models trained on LLM measuered uncertainty and predicting human uncertainty. Top: 3-fold cross validation, dark green background bar indicates the mean correlation across models. Models ordered by mean correlation. Bottom: Results when model trained and tested on the entire dataset.