Table of Contents
Fetching ...

Uncertainty-Aware Evaluation for Vision-Language Models

Vasily Kostumov, Bulat Nutfullin, Oleg Pilipenko, Eugene Ilyushin

TL;DR

Uncertainty-Aware Evaluation for Vision-Language Models addresses the gap in evaluating vision-language models by incorporating uncertainty quantification through conformal prediction. The authors construct a CP-based benchmark across five multiple-choice VQA datasets and analyze 20+ VLMs, using two score functions (LAC and APS) and calibration metrics, with alpha set to 0.1 for 90% coverage. They demonstrate that accuracy and uncertainty are not aligned, observe varied effects of increasing LLM size and chat finetuning on uncertainty, and highlight the practical value of joint accuracy-uncertainty evaluation for safer and more reliable VLM deployment. The work provides a scalable, model-agnostic uncertainty framework that complements traditional accuracy-focused benchmarks and points to future work on broader vision-language tasks and robustness concerns.

Abstract

Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in popularity recently due to their impressive performance in several vision-language tasks. Current evaluation methods, however, overlook an essential component: uncertainty, which is crucial for a comprehensive assessment of VLMs. Addressing this oversight, we present a benchmark incorporating uncertainty quantification into evaluating VLMs. Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question Answering (VQA) task. We examine models on 5 datasets that evaluate various vision-language capabilities. Using conformal prediction as an uncertainty estimation approach, we demonstrate that the models' uncertainty is not aligned with their accuracy. Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs. Our empirical findings also reveal a correlation between model uncertainty and its language model part.

Uncertainty-Aware Evaluation for Vision-Language Models

TL;DR

Uncertainty-Aware Evaluation for Vision-Language Models addresses the gap in evaluating vision-language models by incorporating uncertainty quantification through conformal prediction. The authors construct a CP-based benchmark across five multiple-choice VQA datasets and analyze 20+ VLMs, using two score functions (LAC and APS) and calibration metrics, with alpha set to 0.1 for 90% coverage. They demonstrate that accuracy and uncertainty are not aligned, observe varied effects of increasing LLM size and chat finetuning on uncertainty, and highlight the practical value of joint accuracy-uncertainty evaluation for safer and more reliable VLM deployment. The work provides a scalable, model-agnostic uncertainty framework that complements traditional accuracy-focused benchmarks and points to future work on broader vision-language tasks and robustness concerns.

Abstract

Vision-Language Models like GPT-4, LLaVA, and CogVLM have surged in popularity recently due to their impressive performance in several vision-language tasks. Current evaluation methods, however, overlook an essential component: uncertainty, which is crucial for a comprehensive assessment of VLMs. Addressing this oversight, we present a benchmark incorporating uncertainty quantification into evaluating VLMs. Our analysis spans 20+ VLMs, focusing on the multiple-choice Visual Question Answering (VQA) task. We examine models on 5 datasets that evaluate various vision-language capabilities. Using conformal prediction as an uncertainty estimation approach, we demonstrate that the models' uncertainty is not aligned with their accuracy. Specifically, we show that models with the highest accuracy may also have the highest uncertainty, which confirms the importance of measuring it for VLMs. Our empirical findings also reveal a correlation between model uncertainty and its language model part.
Paper Structure (29 sections, 8 equations, 8 figures, 7 tables)

This paper contains 29 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Several VLMs predict correct answers while demonstrating different levels of certainty. In case of an incorrect answer, the models' confidence may also vary.
  • Figure 2: Timeline of MultiModal Large Language Models (inspired by zhang2024mmllms)
  • Figure 3: Comparison of LLaVA1.6 with LLM of different size.
  • Figure 4: Comparison of Qwen-VL and Qwen-VL-Chat.
  • Figure 5: Comparison of MobileVLMV2 with LLM of different size.
  • ...and 3 more figures