Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models
Zijun Chen, Wenbo Hu, Guande He, Zhijie Deng, Zheng Zhang, Richang Hong
TL;DR
This work analyzes uncertainty calibration in multimodal large language models (MLLMs) and reveals that, despite stability across fine-tuning and visual training, MLLMs remain miscalibrated overall. It introduces the IDK dataset and an uncertainty analysis framework to study how information from text and images interacts, finding that models are more uncertain about visual content and can improve self-assessment with prompts. The authors propose calibration techniques—temperature scaling and iterative prompt optimization—that substantially enhance reliability across diverse multimodal tasks and datasets, including out-of-distribution scenarios. The findings offer practical guidance for deploying MLLMs responsibly in safety-critical settings and point to avenues for further improving robustness across modalities and prompts.
Abstract
Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering. Proper uncertainty calibration is crucial, yet challenging, for reliable use in areas like healthcare and autonomous driving. This paper investigates representative MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning, as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight how uncertainty differs between text and images and how their integration affects overall uncertainty. To better understand MLLMs' miscalibration and their ability to self-assess uncertainty, we construct the IDK (I don't know) dataset, which is key to evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with proper prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications. Code and IDK dataset: https://github.com/hfutml/Calibration-MLLM.
