Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Hamidreza Dastmalchi; Aijun An; Ali Cheraghian; Hamed Barzamini

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Hamidreza Dastmalchi, Aijun An, Ali Cheraghian, Hamed Barzamini

TL;DR

Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness.

Abstract

While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations -- unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

TL;DR

Abstract

Paper Structure (25 sections, 12 equations, 15 figures, 8 tables)

This paper contains 25 sections, 12 equations, 15 figures, 8 tables.

Introduction
Related Work
Large Vision-Language Models (LVLMs)
Hallucination Mitigation in LVLMs
Test-time Hallucination Suppression in LVLMs
Method
Offline Phase
Inference Phase
Experiments
Implementation Details
Datasets and Benchmarks
Baselines
Results on CHAIR
Results on OPOPE
MMHal Benchmark Results
...and 10 more sections

Figures (15)

Figure 1: Given an image--caption pair $(\boldsymbol{I}, \mathcal{C})$, we generate a counterfactual image $\tilde{\boldsymbol{I}}$ using a diffusion model conditioned on a GPT-perturbed caption $\tilde{\mathcal{C}}$ via controlled perturbations. A vision--language model encodes both $(\boldsymbol{I}, \mathcal{C})$ and $(\tilde{\boldsymbol{I}}, \mathcal{C})$, yielding features $\boldsymbol{h}$ and $\tilde{\boldsymbol{h}}$. Their difference $\boldsymbol{\delta} = \tilde{\boldsymbol{h}} - \boldsymbol{h}$ captures the hallucination direction we later nullify at inference.
Figure 2: (a) Hallucinated image generation: given an image $\boldsymbol{I}_i$ and its ground-truth caption $\mathcal{C}_i$, a GPT model generates a hallucinated caption $\tilde{\mathcal{C}}_i$. The image is then encoded by the encoder $\mathcal{E}$ of the Stable Diffusion Model (SDM), and both forward and reverse diffusion steps are applied, conditioned on $\tilde{\mathcal{C}}_i$, to produce hallucinated images $\tilde{\boldsymbol{I}}_{i,j}$. (b) Estimating hallucination subspace: the LVLM encodes both hallucinated and ground-truth image--caption pairs to extract hidden states. Feature differences $\{ \boldsymbol{\delta}_\ell^{(i)} \}_{i=1}^M$ are computed, stacked, and decomposed via SVD. The top $r$ right-singular vectors are retained in a hallucination basis bank for inference-time suppression.
Figure 3: Inference-time projection mechanism. Given an input image and instruction, the model generates text autoregressively. At each decoding step during generation, hidden states from selected layers are projected onto the subspace orthogonal to the corresponding hallucination space, using the hallucination basis bank obtained in the offline phase.
Figure 4: Radar chart of MMHal scores.
Figure 5: LLaVA-Bench example: CIPHER reduces hallucinations and improves grounding.
...and 10 more figures

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

TL;DR

Abstract

Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Authors

TL;DR

Abstract

Table of Contents

Figures (15)