Table of Contents
Fetching ...

Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models

Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, Richang Hong

TL;DR

This work analyzes uncertainty calibration in multimodal large language models (MLLMs) and reveals that, despite stability across fine-tuning and visual training, MLLMs remain miscalibrated overall. It introduces the IDK dataset and an uncertainty analysis framework to study how information from text and images interacts, finding that models are more uncertain about visual content and can improve self-assessment with prompts. The authors propose calibration techniques—temperature scaling and iterative prompt optimization—that substantially enhance reliability across diverse multimodal tasks and datasets, including out-of-distribution scenarios. The findings offer practical guidance for deploying MLLMs responsibly in safety-critical settings and point to avenues for further improving robustness across modalities and prompts.

Abstract

Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: https://github.com/hfutml/Calibration-MLLM.

Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models

TL;DR

This work analyzes uncertainty calibration in multimodal large language models (MLLMs) and reveals that, despite stability across fine-tuning and visual training, MLLMs remain miscalibrated overall. It introduces the IDK dataset and an uncertainty analysis framework to study how information from text and images interacts, finding that models are more uncertain about visual content and can improve self-assessment with prompts. The authors propose calibration techniques—temperature scaling and iterative prompt optimization—that substantially enhance reliability across diverse multimodal tasks and datasets, including out-of-distribution scenarios. The findings offer practical guidance for deploying MLLMs responsibly in safety-critical settings and point to avenues for further improving robustness across modalities and prompts.

Abstract

Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: https://github.com/hfutml/Calibration-MLLM.

Paper Structure

This paper contains 16 sections, 2 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: Replaces images with text descriptions, the descriptions are generated by GPT-4V and can accurately describe the images
  • Figure 2: Use logits-based likelihood to quantify model uncertainty, where higher confidence means lower uncertainty. Stage 1 refers to the Pre-Trained MLLMs, while Stage 2 follows visual fine-tuning
  • Figure 3: The change in uncertainty of images with different levels of noise as the text description increases. Noise=0 means no noise is added. $NoisyImage$=$Image$+$N(0,Noise)$
  • Figure 4: Changes in ECE after calibration for different models tested on MMBench.
  • Figure 5: Gradually add text descriptions on images with different levels of noise, and observe the changes in uncertainty of information integration models for the two modalities
  • ...and 5 more figures