Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Tobias Groot; Matias Valdenegro-Toro

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Tobias Groot, Matias Valdenegro-Toro

TL;DR

The paper investigates verbalized uncertainty estimation in large language and vision-language models, revealing widespread miscalibration and overconfidence across NLP and image-recognition tasks. By introducing the Japanese Uncertain Scenes dataset (JUS) and the Net Calibration Error ($NCE$), it extends calibration analysis beyond traditional metrics like $ECE$ and $MCE$ to capture the direction of miscalibration. Results show that both LLMs and VLMs struggle to reliably express uncertainty, with GPT-4 offering the best calibration among LLMs and GPT-4V outperforming Gemini Pro Vision among VLMs yet still underperforming in uncertainty estimation. The work highlights the need for improved prompting strategies and model architectures to reliably quantify and convey uncertainty in AI outputs, which is crucial for safe deployment in real-world tasks.

Abstract

Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95% confidence intervals.

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

TL;DR

), it extends calibration analysis beyond traditional metrics like

and

to capture the direction of miscalibration. Results show that both LLMs and VLMs struggle to reliably express uncertainty, with GPT-4 offering the best calibration among LLMs and GPT-4V outperforming Gemini Pro Vision among VLMs yet still underperforming in uncertainty estimation. The work highlights the need for improved prompting strategies and model architectures to reliably quantify and convey uncertainty in AI outputs, which is crucial for safe deployment in real-world tasks.

Abstract

Paper Structure (27 sections, 3 equations, 19 figures, 21 tables)

This paper contains 27 sections, 3 equations, 19 figures, 21 tables.

Introduction
Related Work
Evaluation Approach
Models and Tasks
Datasets
Japanese Uncertain Scenes Image Dataset
Data Gathering
Calibration Errors
Experimental Results
Large Language Models
Sentiment Analysis
Math Word Problems
Named-Entity Recognition
Vision Language Models
Image Recognition on JUS
...and 12 more sections

Figures (19)

Figure 1: Example prompt results for GPT-4V and Gemini Pro Vision on a JUS Prompt \ref{['tab:dataset3']}, where a 95% confidence interval is requested but the correct answer is outside the confidence interval. . This shows that VLMs also have problems with verbalized uncertainty, and provide overconfident answers. GPT4-V is closer to the correct answer. Full prompt is provided in Sec \ref{['sup:prompt_eng']}. Photo taken at the Tōrō-Nagashi on August 6, Hiroshima, Japan (Floating Lantern Ceremony).
Figure 2: Example answers from GPT-4V and Gemini Pro Vision for the image recognition task on three JUS image-prompts. Columns 1 and 3 are incorrect overconfident answers, and Column 2 is underconfident correct. These results show how VLMs produce incorrect verbalized uncertainty.
Figure 3: Synthetic calibration plots demonstrating the interpretation of NCE. All bin sizes are equal. Note how ECE does not indicate direction of miscalibration (overconfidente or underconfident), while NCE does.
Figure 4: Calibration plots and confidence histograms for the sentiment analysis task with binary labels. GPT-3.5 shows closer calibration to the ideal, whereas the other models mostly exhibit underconfidence.
Figure 5: Calibration plots and confidence histograms for the math word problems task. All models exhibit excessive overconfidence except for GPT-4, and all models output extremely high confidence in their answers.
...and 14 more figures

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

TL;DR

Abstract

Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (19)