Table of Contents
Fetching ...

Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

Kassoum Sanogo, Renzo Ardiccioni

TL;DR

This paper tackles hallucinations in vision-language models by introducing a training-free self-correction framework that uses uncertainty-guided visual re-attention. It unifies four uncertainty signals—token entropy, attention dispersion, semantic consistency, and claim hedging—into a single score to identify potentially false claims, then performs targeted, multi-scale visual crops and focused verification questions to iteratively refine responses. Empirical results on POPE and MMHAL-BENCH with Qwen2.5-VL-7B show substantial gains: +4.7 percentage points in adversarial POPE accuracy and a -9.8 percentage points reduction in hallucination rates on MMHAL-BENCH, with most improvements arising in early iterations. The work emphasizes a training-free deployment path for improving reliability in multimodal systems, discusses computational trade-offs, and outlines future directions such as cross-architecture validation and external knowledge integration.

Abstract

Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.

Toward More Reliable Artificial Intelligence: Reducing Hallucinations in Vision-Language Models

TL;DR

This paper tackles hallucinations in vision-language models by introducing a training-free self-correction framework that uses uncertainty-guided visual re-attention. It unifies four uncertainty signals—token entropy, attention dispersion, semantic consistency, and claim hedging—into a single score to identify potentially false claims, then performs targeted, multi-scale visual crops and focused verification questions to iteratively refine responses. Empirical results on POPE and MMHAL-BENCH with Qwen2.5-VL-7B show substantial gains: +4.7 percentage points in adversarial POPE accuracy and a -9.8 percentage points reduction in hallucination rates on MMHAL-BENCH, with most improvements arising in early iterations. The work emphasizes a training-free deployment path for improving reliability in multimodal systems, discusses computational trade-offs, and outlines future directions such as cross-architecture validation and external knowledge integration.

Abstract

Vision-language models (VLMs) frequently generate hallucinated content plausible but incorrect claims about image content. We propose a training-free self-correction framework enabling VLMs to iteratively refine responses through uncertainty-guided visual re-attention. Our method combines multidimensional uncertainty quantification (token entropy, attention dispersion, semantic consistency, claim confidence) with attention-guided cropping of under-explored regions. Operating entirely with frozen, pretrained VLMs, our framework requires no gradient updates. We validate our approach on the POPE and MMHAL BENCH benchmarks using the Qwen2.5-VL-7B [23] architecture. Experimental results demonstrate that our method reduces hallucination rates by 9.8 percentage points compared to the baseline, while improving object existence accuracy by 4.7 points on adversarial splits. Furthermore, qualitative analysis confirms that uncertainty-guided re-attention successfully grounds corrections in visual evidence where standard decoding fails. We validate our approach on Qwen2.5-VL-7B [23], with plans to extend validation across diverse architectures in future versions. We release our code and methodology to facilitate future research in trustworthy multimodal systems.

Paper Structure

This paper contains 51 sections, 10 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overview of our uncertainty-guided self-correction framework. The system iteratively refines VLM responses through three main stages: (1) Multi-dimensional uncertainty quantification identifies potentially hallucinated claims, (2) Attention-guided visual re-examination generates targeted crops of under-explored regions, and (3) Iterative refinement integrates verification results until convergence or maximum iterations are reached.
  • Figure 2: Convergence analysis on POPE-Adversarial split. (a) Uncertainty Decay: Mean uncertainty $u_t$ decreases monotonically from 0.52 (iteration 0) to 0.27 (iteration 3), with diminishing returns after iteration 2. Error bars represent $\pm 1$ standard deviation. (b) Accuracy Improvement: Test accuracy improves from 75.9% to 80.6%, with most gains (2.8 points) in the first iteration. Cumulative convergence rates (percentage of samples reaching $u_t < 0.3$) are overlaid on the right axis.