Table of Contents
Fetching ...

Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

Liu Yu, Zhonghao Chen, Ping Kuang, Zhikun Feng, Fan Zhou, Lan Wang, Gillian Dobbie

TL;DR

The paper tackles object hallucination in LVLMs by modeling visual and textual attention as mediators within a structural causal model. It introduces VTACR to quantify cross-modal contribution and uses VTACR signals to perform token- and layer-wise attention interventions, complemented by a dual-path contrastive decoding strategy that separates faithful from hallucinated outputs. Empirical results on CHAIR and POPE benchmarks show substantial hallucination reductions with preserved or improved vision-language understanding, achieving state-of-the-art faithfulness. The approach provides a principled, causality-driven framework for multimodal generation and offers practical code for replication.

Abstract

Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL

Causally-Grounded Dual-Path Attention Intervention for Object Hallucination Mitigation in LVLMs

TL;DR

The paper tackles object hallucination in LVLMs by modeling visual and textual attention as mediators within a structural causal model. It introduces VTACR to quantify cross-modal contribution and uses VTACR signals to perform token- and layer-wise attention interventions, complemented by a dual-path contrastive decoding strategy that separates faithful from hallucinated outputs. Empirical results on CHAIR and POPE benchmarks show substantial hallucination reductions with preserved or improved vision-language understanding, achieving state-of-the-art faithfulness. The approach provides a principled, causality-driven framework for multimodal generation and offers practical code for replication.

Abstract

Object hallucination remains a critical challenge in Large Vision-Language Models (LVLMs), where models generate content inconsistent with visual inputs. Existing language-decoder based mitigation approaches often regulate visual or textual attention independently, overlooking their interaction as two key causal factors. To address this, we propose Owl (Bi-mOdal attention reWeighting for Layer-wise hallucination mitigation), a causally-grounded framework that models hallucination process via a structural causal graph, treating decomposed visual and textual attentions as mediators. We introduce VTACR (Visual-to-Textual Attention Contribution Ratio), a novel metric that quantifies the modality contribution imbalance during decoding. Our analysis reveals that hallucinations frequently occur in low-VTACR scenarios, where textual priors dominate and visual grounding is weakened. To mitigate this, we design a fine-grained attention intervention mechanism that dynamically adjusts token- and layer-wise attention guided by VTACR signals. Finally, we propose a dual-path contrastive decoding strategy: one path emphasizes visually grounded predictions, while the other amplifies hallucinated ones -- letting visual truth shine and hallucination collapse. Experimental results on the POPE and CHAIR benchmarks show that Owl achieves significant hallucination reduction, setting a new SOTA in faithfulness while preserving vision-language understanding capability. Our code is available at https://github.com/CikZ2023/OWL

Paper Structure

This paper contains 14 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Motivation of our work. (a) Existing methods manipulate attention in a single modality (visual or text). (b) We contrast the visual-favored path and text-favored path based on the VTACR-guided attention calibration. (c) Increasing visual attention improves causal effect but shortens output, while increasing textual attention has the opposite impact. (d) Hallucinated tokens typically show lower VTACR, indicating a skewed visual-to-textual modality reliance.
  • Figure 2: The SCM for analyzing the hallucination process. Visual input ($X_V$) and text input ($X_T$) affect the output ($Y_T$) via visual attention ($A_V$) and text attention ($A_T$). Visual priors ($P_V$) and language priors ($P_T$) confound the attention paths and may cause hallucinations. Interventions on $A_V$ and $A_T$ help estimate their causal impact.
  • Figure 3: The overall framework of Owl. Given image, text, and generation history, Owl performs layer-wise decomposition of visual, textual, and historical attentions. Based on the VTACR distribution, Owl adaptively modulates attention along: a visual-favored path (enhancing grounding) and a text-favored path (amplifying hallucination). A dual-path contrastive decoding strategy then drives the LVLM to suppress hallucinations (e.g., Football) while preserving truthful predictions.
  • Figure 4: Comparison among different VLMs on five VQA benchmarks and three common benchmarks. The highest-performing results are highlighted in boldface.
  • Figure 5: Impact of $\alpha$, $\beta$, and $\lambda$ on hallucination and informativeness in LLaVA-1.5, evaluated on $500$ COCO samples.
  • ...and 3 more figures