Table of Contents
Fetching ...

Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation

Yunpu Zhao, Rui Zhang, Junbin Xiao, Ruibo Hou, Jiaming Guo, Zihao Zhang, Yifan Hao, Yunji Chen

TL;DR

This work proposes a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries and significantly improves the alignment between verbalized confidence and response correctness.

Abstract

Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries. We first introduce a perturbed dataset where Gaussian noise is applied to the key object regions to simulate visual uncertainty at different confidence levels, establishing an explicit mapping between visual ambiguity and confidence levels. We further enhance calibration through a two-stage training process combining supervised fine-tuning on the perturbed dataset with subsequent preference optimization. Extensive experiments on popular benchmarks demonstrate that our method significantly improves the alignment between verbalized confidence and response correctness while maintaining or enhancing overall task performance. These results highlight the potential of semantic perturbation as a practical tool for improving the reliability and interpretability of VLMs.

Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation

TL;DR

This work proposes a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries and significantly improves the alignment between verbalized confidence and response correctness.

Abstract

Vision-language models (VLMs) excel in various multimodal tasks but frequently suffer from poor calibration, resulting in misalignment between their verbalized confidence and response correctness. This miscalibration undermines user trust, especially when models confidently provide incorrect or fabricated information. In this work, we propose a novel Confidence Calibration through Semantic Perturbation (CSP) framework to improve the calibration of verbalized confidence for VLMs in response to object-centric queries. We first introduce a perturbed dataset where Gaussian noise is applied to the key object regions to simulate visual uncertainty at different confidence levels, establishing an explicit mapping between visual ambiguity and confidence levels. We further enhance calibration through a two-stage training process combining supervised fine-tuning on the perturbed dataset with subsequent preference optimization. Extensive experiments on popular benchmarks demonstrate that our method significantly improves the alignment between verbalized confidence and response correctness while maintaining or enhancing overall task performance. These results highlight the potential of semantic perturbation as a practical tool for improving the reliability and interpretability of VLMs.

Paper Structure

This paper contains 37 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: (Top): Most current VLMs tend to generate high verbalized confidence on incorrect response. (Bottom): After calibration, model's verbalized confidence will be aligned with response correctness.
  • Figure 2: The image illustrates the dataset construction and training pipeline for improving confidence calibration in VLM. It highlights the two-stage process: Dataset Construction: Extracting key object regions using GroundingDINO and SAM, applying semantic perturbations, and assigning confidence labels based on noise levels. Training Pipeline: Fine-tuning the VLM with supervised learning, followed by preference optimization, to improve probability-confidence alignment and response calibration.
  • Figure 3: ROC curves (top row) and probability calibration plots (bottom row) on the AMBER attribute dataset, comparing their performance before and after applying our proposed confidence calibration method. The ROC curves illustrate improved true positive rates (higher AUC values) after training, while the probability calibration plots indicate better alignment between predicted confidence and correctness (lower Brier Scores).
  • Figure 4: Ablation results for different variants of our method under POPE adversarial of model Qwen2
  • Figure 5: Comparison of Accuracy, Precision, Recall, and F1 Score across different model configurations.
  • ...and 5 more figures