What Large Language Models Know and What People Think They Know

Mark Steyvers; Heliodoro Tejeda; Aakriti Kumar; Catarina Belem; Sheer Karny; Xinyue Hu; Lukas Mayer; Padhraic Smyth

What Large Language Models Know and What People Think They Know

Mark Steyvers, Heliodoro Tejeda, Aakriti Kumar, Catarina Belem, Sheer Karny, Xinyue Hu, Lukas Mayer, Padhraic Smyth

TL;DR

This work investigates how well users understand LLM uncertainty and whether explanations can bridge the gap between what LLMs know and what people think they know. By extracting model confidence from token likelihoods on MMLU MC and TriviaQA SA tasks, the authors quantify a calibration gap and a discrimination gap in human judgments. They show that default explanations inflate user confidence and that longer explanations magnify this effect, while prompting strategies that align explanation uncertainty with the model’s internal confidence substantially narrow both gaps ($ECE$, $AUC$). The findings highlight the importance of truthful uncertainty communication and demonstrate that careful design of explanation styles can improve trust and decision-making in AI-assisted contexts, with practical implications for deploying LLMs in high-stakes settings.

Abstract

As artificial intelligence (AI) systems, particularly large language models (LLMs), become increasingly integrated into decision-making processes, the ability to trust their outputs is crucial. To earn human trust, LLMs must be well calibrated such that they can accurately assess and communicate the likelihood of their predictions being correct. Whereas recent work has focused on LLMs' internal confidence, less is understood about how effectively they convey uncertainty to users. Here we explore the calibration gap, which refers to the difference between human confidence in LLM-generated answers and the models' actual confidence, and the discrimination gap, which reflects how well humans and models can distinguish between correct and incorrect answers. Our experiments with multiple-choice and short-answer questions reveal that users tend to overestimate the accuracy of LLM responses when provided with default explanations. Moreover, longer explanations increased user confidence, even when the extra length did not improve answer accuracy. By adjusting LLM explanations to better reflect the models' internal confidence, both the calibration gap and the discrimination gap narrowed, significantly improving user perception of LLM accuracy. These findings underscore the importance of accurate uncertainty communication and highlight the effect of explanation length in influencing user trust in AI-assisted decision-making environments. Code and Data can be found at https://osf.io/y7pr6/ . Journal publication can be found on Nature Machine Intelligence at https://www.nature.com/articles/s42256-024-00976-7 .

What Large Language Models Know and What People Think They Know

TL;DR

). The findings highlight the importance of truthful uncertainty communication and demonstrate that careful design of explanation styles can improve trust and decision-making in AI-assisted contexts, with practical implications for deploying LLMs in high-stakes settings.

Abstract

Paper Structure (41 sections, 5 equations, 8 figures, 6 tables)

This paper contains 41 sections, 5 equations, 8 figures, 6 tables.

Introduction
Large Language Models
Methodology
Question data sets
MMLU dataset for multiple choice questions.
Trivia QA dataset for short answer questions.
Assessing model confidence and creating question subsets
Multiple choice questions.
Short-answer questions.
Behavioral Experiments
Participants
Experimental Procedure
Creating explanation styles with varying degrees of uncertainty
Experiment 1: baseline explanations.
Experiment 2: modified explanations.
...and 26 more sections

Figures (8)

Figure 1: Overview of the evaluation methodology for assessing the calibration gap between model confidence and human confidence in the model. The multiple choice questions (top), the approach works as follows: (1) prompt the LLM with a question to obtain the model's internal confidence for each answer choice; (2) select the most likely answer and prompt the model a second time to generate an explanation for the given answer; (3) obtain the human confidence by showing users the question and the LLM's explanation and asking users to indicate the probability that the model is correct. In this toy example the model confidence for the multiple choice question is 0.46 for answer C, whereas the human confidence is 0.95. For short-answer questions, the approach is similar except that internal model confidence is obtained by an additional step where the LLM is prompted to evaluate whether the previously provided answer to the question is true or false kadavath2022language. In the short-answer question example, the LLM model explanation was modified with uncertainty language to convey the low model confidence (0.18). For the two toy examples, the correct answers are "A" and "blue bird".
Figure 2: Calibration error and discrimination for model confidence and human confidence across the behavioral experiments and LLMs. Calibration error is assessed by ECE (lower is better) while discrimination is assessed by AUC (higher is better). Vertical dashed lines represent the calibration and discrimination gap between model confidence and human confidence for unmodified explanations (Experiments 1a, 1b, and 1c). For human confidence, data points represent the AUC values computed separately for each participant ($n$ shown in figure), and error bars represent the 95% confidence interval of the mean across participants. Because of data sparsity, the ECE values were computed at the group level.
Figure 3: Calibration diagrams for model confidence and human confidence across Experiments 1 and 2. The top and middle rows show results for multiple-choice questions with the GPT-3.5 and PaLM2 models, respectively. The bottom row shows results for short-answer questions with the GPT-4o model. The histograms at the bottom of each plot show the proportion of observations in each confidence bin (values are scaled by 30% for visual clarity). Shaded regions represent the 95% confidence interval of the mean computed across participants and questions.
Figure 4: Mean human confidence across LLM explanation styles varying in uncertainty language and length. Data are presented as mean values of participant confidence in Experiments 2a ($n$=60), 2b ($n$=60), and 2c ($n$=59). For reference, dashed lines show the average human confidence for the baseline explanations in Experiment 1a, 1b, and 1c. Error bars represent the 95% confidence interval of the mean across participants.
Figure 5: Calibration diagrams for the full set of MMLU questions for GPT-3.5 and PaLM2 (left and middle panel) and the 5000 question sample of the Trivia QA data set using the GPT-4o model (right panel).
...and 3 more figures

What Large Language Models Know and What People Think They Know

TL;DR

Abstract

What Large Language Models Know and What People Think They Know

Authors

TL;DR

Abstract

Table of Contents

Figures (8)