SelfReflect: Can LLMs Communicate Their Internal Answer Distribution?
Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Seong Joon Oh, Sinead Williamson
TL;DR
SelfReflect introduces an information-theoretic metric to evaluate whether a single string can faithfully summarize an LLM's internal uncertainty distribution over answers. It shows that modern LLMs struggle to reveal their internal uncertainties through reasoning or finetuning alone, but that sampling multiple outputs and feeding them back for summarization enables faithful self-reflection. The metric uses predictive sufficiency concepts and a masked-token task evaluated by a judge LLM via 1-Wasserstein distance, and is validated across open-ended and multiple-choice datasets with human alignment benchmarks. The work highlights a viable path forward for communicating LLM uncertainties and suggests sampling-based guidance as a robust approach for practical, trustworthy interactions and future uncertainty quantification frameworks.
Abstract
The common approach to communicate a large language model's (LLM) uncertainty is to add a percentage number or a hedging word to its response. But is this all we can do? Instead of generating a single answer and then hedging it, an LLM that is fully transparent to the user needs to be able to reflect on its internal belief distribution and output a summary of all options it deems possible, and how likely they are. To test whether LLMs possess this capability, we develop the SelfReflect metric, an information-theoretic distance between a given summary and a distribution over answers. In interventional and human studies, we find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers. With SelfReflect, we make a resounding negative observation: modern LLMs are, across the board, incapable of revealing what they are uncertain about, neither through reasoning, nor chains-of-thoughts, nor explicit finetuning. However, we do find that LLMs are able to generate faithful summaries of their uncertainties if we help them by sampling multiple outputs and feeding them back into the context. This simple approach shines a light at the universal way of communicating LLM uncertainties whose future development the SelfReflect score enables.
