Table of Contents
Fetching ...

Treble Counterfactual VLMs: A Causal Approach to Hallucination

Shawn Li, Jiashu Qu, Yuxiao Zhou, Yuehan Qin, Tiankai Yang, Yue Zhao

TL;DR

This work tackles hallucination in Vision-Language Models by adopting a causal perspective. It designs structural causal graphs to separate correct multi-modal fusion from unintended direct modality influences, and uses counterfactuals to estimate Natural Direct Effects for vision, text, and their interaction. A test-time intervention then dynamically reweights intermediate representations to suppress direct modality biases while preserving fusion-driven reasoning, leading to improved grounding and reduced hallucinations across benchmarks. The approach demonstrates robust gains without sacrificing task performance and offers an interpretable, reproducible framework for enhancing VLM reliability.

Abstract

Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/TREE985/Treble-Counterfactual-VLMs.

Treble Counterfactual VLMs: A Causal Approach to Hallucination

TL;DR

This work tackles hallucination in Vision-Language Models by adopting a causal perspective. It designs structural causal graphs to separate correct multi-modal fusion from unintended direct modality influences, and uses counterfactuals to estimate Natural Direct Effects for vision, text, and their interaction. A test-time intervention then dynamically reweights intermediate representations to suppress direct modality biases while preserving fusion-driven reasoning, leading to improved grounding and reduced hallucinations across benchmarks. The approach demonstrates robust gains without sacrificing task performance and offers an interpretable, reproducible framework for enhancing VLM reliability.

Abstract

Vision-Language Models (VLMs) have advanced multi-modal tasks like image captioning, visual question answering, and reasoning. However, they often generate hallucinated outputs inconsistent with the visual context or prompt, limiting reliability in critical applications like autonomous driving and medical imaging. Existing studies link hallucination to statistical biases, language priors, and biased feature learning but lack a structured causal understanding. In this work, we introduce a causal perspective to analyze and mitigate hallucination in VLMs. We hypothesize that hallucination arises from unintended direct influences of either the vision or text modality, bypassing proper multi-modal fusion. To address this, we construct a causal graph for VLMs and employ counterfactual analysis to estimate the Natural Direct Effect (NDE) of vision, text, and their cross-modal interaction on the output. We systematically identify and mitigate these unintended direct effects to ensure that responses are primarily driven by genuine multi-modal fusion. Our approach consists of three steps: (1) designing structural causal graphs to distinguish correct fusion pathways from spurious modality shortcuts, (2) estimating modality-specific and cross-modal NDE using perturbed image representations, hallucinated text embeddings, and degraded visual inputs, and (3) implementing a test-time intervention module to dynamically adjust the model's dependence on each modality. Experimental results demonstrate that our method significantly reduces hallucination while preserving task performance, providing a robust and interpretable framework for improving VLM reliability. To enhance accessibility and reproducibility, our code is publicly available at https://github.com/TREE985/Treble-Counterfactual-VLMs.

Paper Structure

This paper contains 16 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Case study illustrating the impact of our method on VLM hallucination. The figure compares outputs from the original model and our enhanced approach, highlighting reductions in hallucinated content and improved alignment with the visual context. Our method effectively mitigates incorrect descriptions by refining modality interactions, leading to more accurate and reliable multi-modal reasoning.
  • Figure 2: Causal graphs for single-modal models and Vision-Language Models (VLMs) are shown. An optimal VLM generates answers conditioned on both vision and text input pairs. However, vision and text inputs may individually exert a direct influence on the output. This direct influence can lead to the hallucination problem in VLMs, where the generated answers are inconsistent with the provided visual or textual context. T: Text input. V: Vision input. A: Answer.
  • Figure 3: Overall performance and detailed score of different methods on the 8 question categories of MMHal-Bench. Our method achieves the best overall performance and significantly outperforms existing methods (VCD, Opera) in Attribute and Comparison.
  • Figure A: Case study illustrating the impact of our method on VLM hallucination. The figure compares outputs from the original model and our enhanced approach, highlighting reductions in hallucinated content and improved alignment with the visual context. Our method effectively mitigates incorrect textual descriptions by refining modality interactions, leading to more accurate and reliable multi-modal reasoning.

Theorems & Definitions (2)

  • Definition 1: Causal Notations
  • Definition 2: Natural Direct Effects (NDE)