Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Fushuo Huo; Wenchao Xu; Zhong Zhang; Haozhao Wang; Zhicheng Chen; Peilin Zhao

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Fushuo Huo, Wenchao Xu, Zhong Zhang, Haozhao Wang, Zhicheng Chen, Peilin Zhao

TL;DR

This work tackles LVLM hallucinations by proposing Self-Introspective Decoding (SID), a training-free decoding strategy that uses Context and Text-aware Token Selection (CT$^2$S) to adaptively prune vision tokens in early decoder layers based on self-attention, thereby inducing targeted vision-text hallucinations. These amplified hallucinations are then subtracted from the original logits to guide faithful next-token predictions. SID rethinks contrastive decoding by addressing vision-text disturbances as context- and text-informed tokens rather than holistic input perturbations, achieving lower hallucination rates and higher-quality text without extra knowledge or heavy computation. Extensive experiments across multiple LVLMs and benchmarks (CHAIR, POPE, GPT-4V SHR, MMbench/MME) demonstrate SID’s effectiveness and efficiency, highlighting its practical potential for trustworthy multimodal generation.

Abstract

While Large Vision-Language Models (LVLMs) have rapidly advanced in recent years, the prevalent issue known as the `hallucination' problem has emerged as a significant bottleneck, hindering their real-world deployments. Existing methods mitigate this issue mainly from two perspectives: One approach leverages extra knowledge like robust instruction tuning LVLMs with curated datasets or employing auxiliary analysis networks, which inevitable incur additional costs. Another approach, known as contrastive decoding, induces hallucinations by manually disturbing the vision or instruction raw inputs and mitigates them by contrasting the outputs of the disturbed and original LVLMs. However, these approaches rely on empirical holistic input disturbances and double the inference cost. To avoid these issues, we propose a simple yet effective method named Self-Introspective Decoding (SID). Our empirical investigation reveals that pretrained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. We develop the Context and Text-aware Token Selection (CT2S) strategy, which preserves only unimportant vision tokens after early layers of LVLMs to adaptively amplify text-informed hallucination during the auto-regressive decoding. This approach ensures that multimodal knowledge absorbed in the early layers induces multimodal contextual rather than aimless hallucinations. Subsequently, the original token logits subtract the amplified vision-and-text association hallucinations, guiding LVLMs decoding faithfully. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without extra knowledge and much additional computation burdens.

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

TL;DR

This work tackles LVLM hallucinations by proposing Self-Introspective Decoding (SID), a training-free decoding strategy that uses Context and Text-aware Token Selection (CT

S) to adaptively prune vision tokens in early decoder layers based on self-attention, thereby inducing targeted vision-text hallucinations. These amplified hallucinations are then subtracted from the original logits to guide faithful next-token predictions. SID rethinks contrastive decoding by addressing vision-text disturbances as context- and text-informed tokens rather than holistic input perturbations, achieving lower hallucination rates and higher-quality text without extra knowledge or heavy computation. Extensive experiments across multiple LVLMs and benchmarks (CHAIR, POPE, GPT-4V SHR, MMbench/MME) demonstrate SID’s effectiveness and efficiency, highlighting its practical potential for trustworthy multimodal generation.

Abstract

Paper Structure (22 sections, 7 equations, 16 figures, 15 tables)

This paper contains 22 sections, 7 equations, 16 figures, 15 tables.

Introduction
Related Work
Preliminary and Motivation
Paradigm of LVLMs Generation
Re-thinking Contrastive Decoding in LVLMs
Methodology
Understanding the Self-Introspective Pre-trained LVLMs.
Context and Text-aware Token Selection (CT$^2$S) Strategy.
Experiments
Experimental Settings
Evaluation Results
Ablation Analyses
Conclusion and Future Work
Acknowledgements
Appendix
...and 7 more sections

Figures (16)

Figure 1: Contrastive Decoding strategies: (a) Visual Contrastive Decoding (VCD) vcdmanually distort vision inputs. (b) Instruction Contrastive Decoding (ICD) icdid also manually design noisy instruction (negative prompt). Detailed analyses are in Sec. \ref{['sec3.2']}. We ablate other modules like the vision encoder and tokenizer for clarity. t: 'Please describe this image in detail.'; sys.: system prompt. $g$: generated text tokens. $\alpha$ in Eq. \ref{['eq2']} defaults to 1.
Figure 2: Overview of Self-Introspective Decoding (SID). CT$^2$S: Context and Text-aware Token Selection strategy. LLaVA-1.5 7B is utilized as an example to visualize visual tokens with low and high scores (Eq. \ref{['eq5']}).
Figure 3: Visualization Results of the least important vision tokens on discrimination tasks informed by preceding vision and text tokens. LLaVA-1.5 7B with Layer $i=3$ is utilized.
Figure 4: Visualization Results of Adaptively Selecting the least important vision tokens on open-end generative tasks informed by preceding vision and text tokens. LLaVA-1.5 7B with Layer $i=3$ is utilized.
Figure 5: Instance Illustration of Different Disturbance Results. Examples are from MSCOCO inferred by LLaVA-1.5 7B with $i=3$ and Top-k=50. Hallucinations are marked in red.
...and 11 more figures

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

TL;DR

Abstract

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)