Table of Contents
Fetching ...

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

Jiaye Qian, Ge Zheng, Yuchen Zhu, Sibei Yang

TL;DR

This paper addresses LVLM hallucinations by proposing AllPath, a unified intervention framework aligned with transformer causal structure to mitigate hallucinations across image-to-text and text-to-text pathways. It introduces fast, head-centric probing methods that identify crucial text-to-text and image-to-text attention heads and demonstrates that LVLMs adaptively select different causal pathways based on question format. By applying adaptive interventions on the identified heads with pathway-aware scaling, AllPath achieves consistent improvements across POPE, MCQ-POPE, CHAIR, and MME benchmarks while maintaining efficiency. The findings reveal nuanced internal mechanisms of LVLMs, showing that multi-path consideration and adaptive pathway engagement are essential for reliable multimodal reasoning in real-world tasks.

Abstract

Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

Intervene-All-Paths: Unified Mitigation of LVLM Hallucinations across Alignment Formats

TL;DR

This paper addresses LVLM hallucinations by proposing AllPath, a unified intervention framework aligned with transformer causal structure to mitigate hallucinations across image-to-text and text-to-text pathways. It introduces fast, head-centric probing methods that identify crucial text-to-text and image-to-text attention heads and demonstrates that LVLMs adaptively select different causal pathways based on question format. By applying adaptive interventions on the identified heads with pathway-aware scaling, AllPath achieves consistent improvements across POPE, MCQ-POPE, CHAIR, and MME benchmarks while maintaining efficiency. The findings reveal nuanced internal mechanisms of LVLMs, showing that multi-path consideration and adaptive pathway engagement are essential for reliable multimodal reasoning in real-world tasks.

Abstract

Despite their impressive performance across a wide range of tasks, Large Vision-Language Models (LVLMs) remain prone to hallucination. In this study, we propose a comprehensive intervention framework aligned with the transformer's causal architecture in LVLMs, integrating the effects of different intervention paths on hallucination. We find that hallucinations in LVLMs do not arise from a single causal path, but rather from the interplay among image-to-input-text, image-to-output-text, and text-to-text pathways. For the first time, we also find that LVLMs rely on different pathways depending on the question-answer alignment format. Building on these insights, we propose simple yet effective methods to identify and intervene on critical hallucination heads within each pathway, tailored to discriminative and generative formats. Experiments across multiple benchmarks demonstrate that our approach consistently reduces hallucinations across diverse alignment types.

Paper Structure

This paper contains 33 sections, 12 equations, 13 figures, 7 tables, 2 algorithms.

Figures (13)

  • Figure 1: Left: Our AllPath intervention frameowrk comprehensively mitigates hallucinations from image-to-input-text, image-to-output-text, and text-to-text paths. Right: AllPath achieves significant performance improvements over the baselines across all benchmarks.
  • Figure 2: The overview of our proposed AllPath. AllPath first identifies the most critical text-to-text and image-to-text heads contributing to hallucinations using the Log Probability Increase (LPI) score and the ratio of key object attention to total image attention. Then, by applying adaptive heads interventions, AllPath mitigates hallucinations by manipulating the casual pathways in LVLMs.
  • Figure 3: Visualization of the rank distributions of text-to-text heads extracted from different datasets. $\rho$ denotes the correlation coefficient; a higher value (i.e., points concentrated along the diagonal) indicates greater similarity between the two sets of heads. Heads identified from datasets with similar question–answer alignment formats exhibit strong correlation and substantial overlap.
  • Figure 4: Left: Visualization of the rank distributions of image-to-input-text heads and image-to-output-text heads, showing that the two are largely uncorrelated. Right: Compared to the average of all heads, all image-to-text heads we identified, exhibit a stronger focus on visual content.
  • Figure 5: Left: When we entirely knocked out the attention weights of output tokens attending to image tokens, POPE's performance worsens very little, whereas CHAIR's performance gets worse significantly. Right: This indicates that LVLMs utilize different pathways for different question formats.
  • ...and 8 more figures