Table of Contents
Fetching ...

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Shu-Tao Xia

TL;DR

This paper tackles hallucinations in large vision-language models by introducing CMI-VLD, a decoding strategy guided by conditional mutual information between image $V$ and generated text $Y$ conditioned on instruction $X$. It formulates a bi-level optimization with (i) calibrated text sampling that leverages a vision-conditioned distribution and (ii) a lightweight visual token purifier that refines image tokens using attention cues, all implemented via differentiable Gumbel-Softmax training. Empirical results across multiple LVLMs and benchmarks (CHAIR, POPE, GPT-4o SHR, MME, MMBench) show substantial reductions in object-level and sentence-level hallucinations while maintaining or improving decoding efficiency. The approach emphasizes explicit grounding by maximizing cross-modal dependency, offering a principled, scalable path to more reliable LVLMs in real-world settings.

Abstract

Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs

TL;DR

This paper tackles hallucinations in large vision-language models by introducing CMI-VLD, a decoding strategy guided by conditional mutual information between image and generated text conditioned on instruction . It formulates a bi-level optimization with (i) calibrated text sampling that leverages a vision-conditioned distribution and (ii) a lightweight visual token purifier that refines image tokens using attention cues, all implemented via differentiable Gumbel-Softmax training. Empirical results across multiple LVLMs and benchmarks (CHAIR, POPE, GPT-4o SHR, MME, MMBench) show substantial reductions in object-level and sentence-level hallucinations while maintaining or improving decoding efficiency. The approach emphasizes explicit grounding by maximizing cross-modal dependency, offering a principled, scalable path to more reliable LVLMs in real-world settings.

Abstract

Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.

Paper Structure

This paper contains 22 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: (a) Illustration of the attention bias of LVLMs. While image tokens constitute the majority of the input tokens, they receive significantly less cumulative attention scores compared to text tokens. (b) The proposed purification mechanism when the masking ratio is 50%. Our method promotes more reliable generation by adaptively retaining image tokens with high relevance to the ongoing response.
  • Figure 2: Overview of the proposed CMI-VLD decoding. At each timestep $t$, CMI-VLD mitigates hallucination by maximizing mutual dependency between the visual input and the ongoing response through the proposed vision-language purification. Specifically, the visual token purifier first incorporates current input tokens to predict an image mask $\mathcal{M}_v$, which filters out irrelevant visual tokens to enhance C-PMI. Based on the refined visual input, a text token distribution is correspondingly constructed to penalize hallucination-related text tokens and hence guide the next-token sampling to further strengthen the dependency on the visual input.
  • Figure 3: GPT-4o assisted benchmark. We calculate the Sentence-level Hallucination Ratio (SHR) as the major metric for hallucination degree, along with 1&2-gram, the number of sentences per image (SPI), and the number of words per image (WPI). A larger radar area indicates better performance.
  • Figure 4: Generation time per sample of different methods.
  • Figure 5: CHAIR$_S$ results of the proposed CMI-VLD under varying values of hyperparameters $\alpha$ and $\lambda$.
  • ...and 5 more figures