Table of Contents
Fetching ...

Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li, Jose M. Alvarez

TL;DR

This work tackles hallucinations in multimodal LLMs by introducing GACD, an inference-time method that uses a first-order Taylor expansion to estimate token-level biases in the visual and textual inputs. It then mitigates hallucinations via object-aware visual token grouping and anchor-specific influence-weighted decoding, optionally applying sample-dependent early stopping to maintain grounding in long generations. Across open-ended generation and discriminative tasks, GACD reduces hallucinations, improves grounding, and preserves or enhances informativeness, achieving notable gains on benchmarks such as AMBER, POPE, and LLaVA-QA90 while remaining model-agnostic at inference. The approach improves visual grounding without retraining or external models, offering a practical and scalable solution with broad applicability to vision-language tasks and beyond.

Abstract

Multimodal large language models achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text-visual bias, the overreliance on prompts and prior outputs, and co-occurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens-visual features and text tokens-to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.

Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

TL;DR

This work tackles hallucinations in multimodal LLMs by introducing GACD, an inference-time method that uses a first-order Taylor expansion to estimate token-level biases in the visual and textual inputs. It then mitigates hallucinations via object-aware visual token grouping and anchor-specific influence-weighted decoding, optionally applying sample-dependent early stopping to maintain grounding in long generations. Across open-ended generation and discriminative tasks, GACD reduces hallucinations, improves grounding, and preserves or enhances informativeness, achieving notable gains on benchmarks such as AMBER, POPE, and LLaVA-QA90 while remaining model-agnostic at inference. The approach improves visual grounding without retraining or external models, offering a practical and scalable solution with broad applicability to vision-language tasks and beyond.

Abstract

Multimodal large language models achieve strong performance across diverse tasks but remain prone to hallucinations, where outputs are not grounded in visual inputs. This issue can be attributed to two main biases: text-visual bias, the overreliance on prompts and prior outputs, and co-occurrence bias, spurious correlations between frequently paired objects. We propose Gradient-based Influence-Aware Constrained Decoding (GACD), an inference-based method, that addresses both biases without auxiliary models, and is readily applicable to existing models without finetuning. The core of our approach is bias estimation, which uses first-order Taylor gradients to understand the contribution of individual tokens-visual features and text tokens-to the current output. Based on this analysis, GACD mitigates hallucinations through two components: (1) suppressing spurious visual features correlated with the output objects, and (2) rebalancing cross-modal contributions by strengthening visual features relative to text. Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.

Paper Structure

This paper contains 43 sections, 17 equations, 9 figures, 24 tables.

Figures (9)

  • Figure 1: Overview of our influence-aware constrained decoding framework, which mitigates hallucinations by regulating token-level influence. It reduces text–visual bias by enhancing visual token influence (blue bars) in alignment with the most influential text inputs— prompts (gray) or previous outputs (white). It further mitigates co-occurrence bias through anchor-specific suppression, selectively suppressing visual tokens (green, magenta) anchored to previously emitted nouns.
  • Figure 1: Comparison of prediction confidence with and without GACD. (a) Without GACD, mPLUGOwl2 exhibits low confidence in hallucinated predictions and near-zero confidence in the initial predictions for 'forks' and 'mug'. (b) With GACD, mPLUGOwl2's confidence increases alongside the visual influence ratio, effectively mitigating hallucinations.
  • Figure 2: Overview of GACD. The method comprises (i) Object-aware Visual Token Grouping and (ii) Anchor-specific Influence-Weighted Decoding. At step $m$, previously mentioned objects are detected from $\mathbf{y}_{<m}$; visual tokens are partitioned into object-related textcolordarkgreen$\mathbf{t}^{o}$ and unrelated $\mathbf{t}^{u}$ via token influence (Sec. \ref{['sec:token_In']}). Anchor-specific Influence-weighted Decoding extends contrastive decoding with token influence, explicitly amplifying the influence of $\mathbf{t}^u$ to jointly counter text-visual and co-occurrence biases; negative-guidance logits $\mathbf{z}_m^{o}$ are generated from $\{\textcolor{darkgreen}{\mathbf{t}^o}, \mathbf{t}^p, \mathbf{y}_{<m}\}$ to suppress text tokens and anchor-specific visual cues. Grouping is invoked only for noun prediction (where co-occurrence arises between object pairs); for non-noun prediction, we set $\textcolor{darkgreen}{\mathbf{t}^{o}}=\varnothing$ and uniformly amplify all visual tokens to balance text–visual bias.
  • Figure 2: Influence Ratio across Predicted Tokens in VQA: (left) Baseline predictions; (right) Predictions with GACD. GACD effectively mitigate Text-Visual GAP, balancing text-visual bias. (f) The original InternVL2 shows a dominant visual influence ratio at the hallucinated prediction 'knife', indicating a co-occurrence bias that remains unaddressed even with dominant visual influence. (g) GACD successfully eliminates co-occurrence hallucinations, including 'knife'.
  • Figure 3: (a) Visual influence ratios across the POPE dataset, illustrating variation across MLLMs. Our method successfully increases the visual influence ratio when it falls below $50\%$. (b) F1 scores for the AMBER discriminative task using LLaVA-v1.5 are consistently improved by our method, with particularly notable gains in the existence and state categories.
  • ...and 4 more figures