An Evaluation of Estimative Uncertainty in Large Language Models

Zhisheng Tang; Ke Shen; Mayank Kejriwal

An Evaluation of Estimative Uncertainty in Large Language Models

Zhisheng Tang, Ke Shen, Mayank Kejriwal

TL;DR

This article studies how divergences in interpreting WEP between humans and LLMs reveal the limits of statistical language models in reproducing the subtleties of communication under uncertainty and investigates the ability of GPT-4 to consistently map statistical expressions of uncertainty to appropriate WEPs.

Abstract

Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares estimative uncertainty in commonly used large language models (LLMs) like GPT-4 and ERNIE-4 to that of humans, and to each other. Here we show that LLMs like GPT-3.5 and GPT-4 align with human estimates for some, but not all, WEPs presented in English. Divergence is also observed when the LLM is presented with gendered roles and Chinese contexts. Further study shows that an advanced LLM like GPT-4 can consistently map between statistical and estimative uncertainty, but a significant performance gap remains. The results contribute to a growing body of research on human-LLM alignment.

An Evaluation of Estimative Uncertainty in Large Language Models

TL;DR

Abstract

Paper Structure (16 sections, 7 figures, 6 tables)

This paper contains 16 sections, 7 figures, 6 tables.

Main
Results
Discussion
Methods
Data availability statement
Additional information

Figures (7)

Figure 1: Distributions of probabilities (expressed as percentages on the x-axis) on 17 words of estimative probability (WEPs) elicited from six sources: human, GPT-3.5-English, GPT-3.5-Chinese, GPT-4-English, GPT-4-Chinese, and ERNIE-4-Chinese. The graphs on the left feature an x-axis range of 0 to 40 and include 8 WEPs on the y-axis, while the graphs on the right have an x-axis range of 40 to 100 and present the other 9 words on the y-axis. Outliers are omitted from the box-and-whisker plots, and there is zero variability in the cases where only - is indicated.
Figure 2: Distributions of probabilities (expressed as percentages on the x-axis) on 17 words of estimative probability (WEPs) elicited from five sources: human, GPT-3.5-Male, GPT-3.5-Female, GPT-4-Male, and GPT-4-Female. The graphs on the left feature an x-axis range of 0 to 40 and include 8 WEPs on the y-axis, while the graphs on the right have an x-axis range of 40 to 100 and present the other 9 words on the y-axis. Outliers are omitted from the box-and-whisker plots, and there is zero variability in the cases where only - is indicated.
Figure 3: Distributions of probability estimations on 12 WEPs (divided into three categories: low, moderate, high probability WEPs) by GPT-3.5 and 4. Each graph shows the estimations given under Male, Female, and Concise Narrative Context (CNC) context settings. The last of these is gender-neutral and serves as a reference. The graphs with low probability words feature an x-axis range of 0 to 40, while the other graphs have an x-axis range of 40 to 100.
Figure 4: A heat map displaying the Kullback-Leibler (KL) divergence between various comparison pairs on 17 words of estimative probability. These pairs are (1) ERNIE-4.0 (prompted in Chinese) and humans, (2) GPT-3.5 or GPT-4 which is prompted using English and Chinese, (3) GPT-3.5 or GPT-4 compared with ERNIE-4.0 (all prompted in Chinese). The intensity of the color within each cell corresponds to the KL divergence values, with darker colors indicating higher divergence. *, **, and *** represent significant statistical significance for the Mann-Whitney U test at confidence levels of 90%, 95%, and 99% levels, respectively. Supplementary Information Figures S19-S21 contain the precise Kolmogorov–Smirnov (KS) statistics used to assess the significance of these divergences.
Figure 5: A bar graph illustrating the performance of GPT-4 on answering questions about the outcome of statistically uncertain events using words of estimating probability (WEPs). The graphs compare scores using four metrics: pair-wise consistency, monotonicity consistency, empirical consistency, and empirical monotonicity consistency, for both standard and Chain-Of-Thought (CoT) prompting methods. Results for each metric are further divided based on different scenarios. The random performance is shown as a red dashed line for each metric. The standard error is shown as a vertical red line, and the numerical value corresponding to each bar is displayed. *, **, and *** represent statistical significance between normal and CoT prompting, using the paired t-test, at the 90%, 95%, and 99% confidence levels, respectively.
...and 2 more figures

An Evaluation of Estimative Uncertainty in Large Language Models

TL;DR

Abstract

An Evaluation of Estimative Uncertainty in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)