Table of Contents
Fetching ...

Improving Metacognition and Uncertainty Communication in Language Models

Mark Steyvers, Catarina Belem, Padhraic Smyth

TL;DR

The paper addresses how to improve explicit uncertainty communication in LLMs and whether such metacognitive signals generalize. It trains GPT-4.1-mini and Llama-3.1-70B on consistency-based uncertainty across tasks (single-question calibration and pairwise comparison) and domains (MMLU-PRO, GSM8K, TriviaQA), then evaluates calibration ($\text{ECE}$) and discrimination ($\text{AUC}$, including $\text{AUC}_c$ and $\text{AUC}_a$) both within and across domains, including out-of-domain medical and legal data. Key findings show that supervised fine-tuning improves calibration and discrimination within domains and, to a degree, across domains, with multitask and multidomain training delivering the broadest generalization; transfers between the two metacognitive tasks are limited when trained separately but improve under joint training. The results imply that uncertainty communication in LLMs is trainable and that multitask, multidomain supervision is a promising path toward safer, more transparent AI in high-stakes settings like medicine and law, while also highlighting model- and task-specific limits in cross-task transfer.

Abstract

Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. Prior work shows that LLMs maintain internal uncertainty signals, yet their expressed confidence is often miscalibrated and poorly discriminates between correct and incorrect answers. We investigate whether supervised fine-tuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We fine-tune LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to answer correctly. We assess generalization to unseen domains, including medical and legal reasoning. Results show that fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations. This suggests that uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively.

Improving Metacognition and Uncertainty Communication in Language Models

TL;DR

The paper addresses how to improve explicit uncertainty communication in LLMs and whether such metacognitive signals generalize. It trains GPT-4.1-mini and Llama-3.1-70B on consistency-based uncertainty across tasks (single-question calibration and pairwise comparison) and domains (MMLU-PRO, GSM8K, TriviaQA), then evaluates calibration () and discrimination (, including and ) both within and across domains, including out-of-domain medical and legal data. Key findings show that supervised fine-tuning improves calibration and discrimination within domains and, to a degree, across domains, with multitask and multidomain training delivering the broadest generalization; transfers between the two metacognitive tasks are limited when trained separately but improve under joint training. The results imply that uncertainty communication in LLMs is trainable and that multitask, multidomain supervision is a promising path toward safer, more transparent AI in high-stakes settings like medicine and law, while also highlighting model- and task-specific limits in cross-task transfer.

Abstract

Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. Prior work shows that LLMs maintain internal uncertainty signals, yet their expressed confidence is often miscalibrated and poorly discriminates between correct and incorrect answers. We investigate whether supervised fine-tuning can improve models' ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We fine-tune LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to answer correctly. We assess generalization to unseen domains, including medical and legal reasoning. Results show that fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations. This suggests that uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively.

Paper Structure

This paper contains 17 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Two metacognitive tasks used to evaluate confidence communication. In the single-question confidence task, the LLM provides a verbalized numeric confidence score along with its answer to a single question. In the pairwise confidence comparison task, the LLM is presented with two questions and first selects the question for which it is more confident. It then provides answers to both questions. In the example for the confidence comparison task, the LLM is more confident about question Q1 and proceeds to provide a correct answer to Q1 and an incorrect answer to Q2. According to the answer correctness scoring method, the confidence comparison answer is considered correct as it discriminates between a correct and incorrect answer. In the reference consistency scoring method (not illustrated), the confidence comparison answer is considered correct if the model picks the question for which self-consistency derived from sampling is higher. The questions shown here are based on trivia questions from the TriviaQA dataset joshi2017triviaqa. Note that our methodology spans multiple answer formats—including short-answer, multiple-choice, and numeric responses—and covers a range of knowledge domains, including general knowledge, law, and medicine.
  • Figure 2: The LLM fine-tuning procedure illustrated with two example questions from the TriviaQA dataset. For each training question, the LLM is sampled multiple times, and the consistency score and modal answer is computed across samples. For the single-question confidence training set, the calibrated target confidence is computed based on the empirical accuracy associated with the consistency score and the target answer is based on the modal response. For the pairwise confidence comparison training set, the target answer for each comparison is the question with the highest consistency score. Note that for the short-answer questions, computing the modal answer and consistency score involves an additional step of clustering the sample answers into semantically similar answers.
  • Figure 3: Calibration diagrams for fine-tuned and baseline models for the single-question confidence task using within-domain test questions. Top and bottom rows show the results for verbalized confidences for the GPT model and Llama model respectively. Results show performance on test questions from the MMLU-PRO (left panels), GSM8K (center panels), and TriviaQA (right panels) when the models are fine-tuned on these domains. The shaded regions represent the 95% confidence interval of the mean computed across questions. The histograms at the bottom of each plot show the proportion of observations in each confidence bin (values are scaled by 30% for visual clarity).
  • Figure 4: Calibration diagrams for fine-tuned and baseline models for the single-question confidence task using out-of-domain questions. Top and bottom rows show results for verbalized confidences for GPT-4.1-Mini and Llama3.1-70B respectively. The fine-tuned models are based on the combined training data of MMLU-PRO, GSM8K and TriviaQA (M+G+T). Models are tested on new domains: TruthfulQA (left panels), MetaMedQA (center panels), and LegalBench (right panels). The shaded regions represent the 95% confidence interval of the mean computed across questions. The histograms at the bottom of each plot show the proportion of observations in each confidence bin (values are scaled by 30% for visual clarity).
  • Figure A1: Calibration results for consistency scores across task domains and prompting styles for GPT4.1 Mini. Results are based on the full data set before subsampling (top row) and the test set after subsampling (bottom row). The histograms at the bottom of each plot show the proportion of observations in each confidence bin (values are scaled by 30% for visual clarity).
  • ...and 3 more figures