Table of Contents
Fetching ...

V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

Nan Sun, Zhenyu Zhang, Xixun Lin, Kun Wang, Yanmin Shang, Naibin Gu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, Yanan Cao

TL;DR

V-ITI targets hallucinations in multimodal LLMs by addressing the timing of interventions. It introduces a Visual Neglect Detector to decide when to intervene and a Visual Recall Intervenor to guide how to intervene, using stored visual activations to reinforce grounding only when neglect is detected. Theoretical analysis via mutual information supports the method's grounding mechanism. Empirically, V-ITI reduces vision-related hallucinations across eight benchmarks and maintains or improves general multimodal performance, with favorable latency and memory overhead compared to competing approaches.

Abstract

Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.

V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention

TL;DR

V-ITI targets hallucinations in multimodal LLMs by addressing the timing of interventions. It introduces a Visual Neglect Detector to decide when to intervene and a Visual Recall Intervenor to guide how to intervene, using stored visual activations to reinforce grounding only when neglect is detected. Theoretical analysis via mutual information supports the method's grounding mechanism. Empirically, V-ITI reduces vision-related hallucinations across eight benchmarks and maintains or improves general multimodal performance, with favorable latency and memory overhead compared to competing approaches.

Abstract

Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.

Paper Structure

This paper contains 18 sections, 1 theorem, 12 equations, 8 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $\boldsymbol{o}_l^h$ denote the original head-wise activation and $\boldsymbol{\hat{o}}_l^h$ denote the modulated activation after intervention. Let $\boldsymbol{X}[v_s:v_e] \subseteq \boldsymbol{X}_l$ be the visual subset of input tokens at $l$-th level. The MI between them satisfies

Figures (8)

  • Figure 1: Illustration of the phenomenon of "over-intervention". For logits intervention (upper), contrasting logits from perturbed and undisturbed visual inputs suppresses the correct answer’s logits, which forces error of answering "White" instead of the correct color "Yellow". For attention intervention (lower), overly enhancing visual attention induces new quantity hallucinations, where the model initially correct in answering "One", mistakenly answers "Two" due to repeated due to excessive attention.
  • Figure 2: As Gaussian perturbations increase, the model’s attention to visual tokens diminishes, and the attention heatmap reveals a loss of focus on question-relevant regions. This visual neglect weakens the model’s ability to perceive visual evidence, ultimately leading to a decline in accuracy and F1 score on the POPE dataset.
  • Figure 3: Illustration of the overall V-ITI architecture. To avoid over-intervention in hallucination mitigation, we propose two modules. The Visual Neglect Detector (VND) determines when intervention is needed by discriminating head-level activation patterns, while the Visual Recall Intervenor (VRI) addresses how to intervene by integrating the original head output with retained visual activations.
  • Figure 4: Sorted probe accuracies on validation set of all attention heads across all layers in the LLaVA-1.5 model.
  • Figure 5: Efficiency comparison of V-ITI against baseline methods. V-ITI achieves near-greedy latency with minimal memory. Logits Intervention (VCD, ICD) average 2.19x latency of Greedy, while Attention Intervention (OPERA, INTER) average 3.46x.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof