Table of Contents
Fetching ...

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Seong Joon Oh, Sinead Williamson

TL;DR

SelfReflect introduces an information-theoretic metric to evaluate whether a single string can faithfully summarize an LLM's internal uncertainty distribution over answers. It shows that modern LLMs struggle to reveal their internal uncertainties through reasoning or finetuning alone, but that sampling multiple outputs and feeding them back for summarization enables faithful self-reflection. The metric uses predictive sufficiency concepts and a masked-token task evaluated by a judge LLM via 1-Wasserstein distance, and is validated across open-ended and multiple-choice datasets with human alignment benchmarks. The work highlights a viable path forward for communicating LLM uncertainties and suggests sampling-based guidance as a robust approach for practical, trustworthy interactions and future uncertainty quantification frameworks.

Abstract

The common approach to communicate a large language model's (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables.

SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?

TL;DR

SelfReflect introduces an information-theoretic metric to evaluate whether a single string can faithfully summarize an LLM's internal uncertainty distribution over answers. It shows that modern LLMs struggle to reveal their internal uncertainties through reasoning or finetuning alone, but that sampling multiple outputs and feeding them back for summarization enables faithful self-reflection. The metric uses predictive sufficiency concepts and a masked-token task evaluated by a judge LLM via 1-Wasserstein distance, and is validated across open-ended and multiple-choice datasets with human alignment benchmarks. The work highlights a viable path forward for communicating LLM uncertainties and suggests sampling-based guidance as a robust approach for practical, trustworthy interactions and future uncertainty quantification frameworks.

Abstract

The common approach to communicate a large language model's (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables.

Paper Structure

This paper contains 49 sections, 6 theorems, 21 equations, 17 figures, 16 tables.

Key Result

proposition 1

(Connection to Predictive Sufficiency)theory:connection-to-predictive-sufficiency For an ideal summary $S$ of answers $A^{({1:N})}$, Intuitively, the ideal summary $S$ is a predictive-sufficient statistic of the answers $A^{({1:N})}$ for $B$.

Figures (17)

  • Figure 1: LLMs have internal answer distributions about user queries. Rather than just sampling an output, possibly combined with a percentage, LLMs should generate a string that is self-reflective of their internal distribution, summarizing all possibilities and which they find the most likely.
  • Figure 2: Graphical model for the sufficiency that SelfReflect quantifies.
  • Figure 3: To test whether a summary string $s$ contains the same information as a set of samples $a^{({1:N})}$, SelfReflect prompts an LLM twice. First, it provides the summary as context; next, it provides the concatenated samples. SelfReflect then compares the resulting distributions via a masked-out task.
  • Figure 4: How often Qwen2.5 72B Instruct answer distributions span multiple possibilities vs how often their CoT summaries do.
  • Figure 5: The graphical model for the setting of SelfReflect metric. The figure is reproduced from Figure \ref{['fig:theory:graphical-model']} of the main text for the sake of better readability of the formalization that follows.
  • ...and 12 more figures

Theorems & Definitions (13)

  • definition 1
  • proposition 1
  • proposition 2
  • definition 2
  • definition 3
  • lemma 1
  • proof
  • theorem 1
  • proof
  • theorem 2
  • ...and 3 more