Table of Contents
Fetching ...

Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models

Mirko Borszukovszki, Ivo Pascal de Jong, Matias Valdenegro-Toro

TL;DR

The study probes how well visual-language models calibrate their verbalized uncertainty when processing corrupted images in VQA and counting tasks. By prompting models to reveal confidence or confidence intervals across Gaussian noise, defocus blur, and JPEG compression, it shows persistent overconfidence and rising calibration errors as corruption severity increases. GPT-4V generally offers the best calibration among the tested models, and higher refusal rates can improve calibration in some scenarios. The work highlights the gap between perception and actual uncertainty in VLMs and points to prompting strategies, selective abstention, and broader corruption testing as directions for safer, more reliable deployment.

Abstract

To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers' uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models' ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.

Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models

TL;DR

The study probes how well visual-language models calibrate their verbalized uncertainty when processing corrupted images in VQA and counting tasks. By prompting models to reveal confidence or confidence intervals across Gaussian noise, defocus blur, and JPEG compression, it shows persistent overconfidence and rising calibration errors as corruption severity increases. GPT-4V generally offers the best calibration among the tested models, and higher refusal rates can improve calibration in some scenarios. The work highlights the gap between perception and actual uncertainty in VLMs and points to prompting strategies, selective abstention, and broader corruption testing as directions for safer, more reliable deployment.

Abstract

To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers' uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models' ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.

Paper Structure

This paper contains 21 sections, 1 equation, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Question: What is on the sheep? With small noise, GPT-4V is confidently incorrect.
  • Figure 2: Sample answer from Claude with Defocus Blur Corruption. Question: Where was this photo taken? Correct Answer: Japan, Kyoto, Arashiyama Area, the Bridge is named Togetsu-kyo Bridge (or Toei Bridge). It is clear how answers and confidence degrade with increasing corruption severity. Full answers in Table \ref{['tab:claude_answers']}.
  • Figure 3: Demonstration of the used corruptions on severity 5. Question: What kind of food is showcased in this photo? Answer: Japanese food. Also acceptable is that it is a food model, called Shokuhin Sampuru in Japanese.
  • Figure 4: Samples from the three tasks. (a) represents the "easy" task, (b) the "hard" task, (c) the "counting" task.
  • Figure 5: Accuracy and confidence plots for the three examined models and the three corruptions in the easy visual question answering experiment.
  • ...and 18 more figures