Improving Metacognition and Uncertainty Communication in Language Models
Mark Steyvers, Catarina Belem, Padhraic Smyth
TL;DR
The paper addresses how to improve explicit uncertainty communication in LLMs and whether such metacognitive signals generalize. It trains GPT-4.1-mini and Llama-3.1-70B on consistency-based uncertainty across tasks (single-question calibration and pairwise comparison) and domains (MMLU-PRO, GSM8K, TriviaQA), then evaluates calibration ($\text{ECE}$) and discrimination ($\text{AUC}$, including $\text{AUC}_c$ and $\text{AUC}_a$) both within and across domains, including out-of-domain medical and legal data. Key findings show that supervised fine-tuning improves calibration and discrimination within domains and, to a degree, across domains, with multitask and multidomain training delivering the broadest generalization; transfers between the two metacognitive tasks are limited when trained separately but improve under joint training. The results imply that uncertainty communication in LLMs is trainable and that multitask, multidomain supervision is a promising path toward safer, more transparent AI in high-stakes settings like medicine and law, while also highlighting model- and task-specific limits in cross-task transfer.
Abstract
Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. Prior work shows that LLMs maintain internal uncertainty signals, yet their expressed confidence is often miscalibrated and poorly discriminates between correct and incorrect answers. We investigate whether supervised fine-tuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We fine-tune LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to answer correctly. We assess generalization to unseen domains, including medical and legal reasoning. Results show that fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations. This suggests that uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively.
