Table of Contents
Fetching ...

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification

Xianwei Zhuang, Zhihong Zhu, Yuxin Xie, Liming Liang, Yuexian Zou

TL;DR

VASparse introduces a fast, plug-and-play decoding framework to mitigate visual hallucinations in LVLMs by enforcing visual-aware token sparsification. It combines a theoretically grounded token selection mechanism, a sparse-based visual contrastive decoding that uses embeddings to avoid extra decoding, and a sinking-attention penalty to prevent language priors from overshadowing visual content. The method achieves state-of-the-art VH mitigation across CHAIR, POPE, MME, and GPT-4 assisted benchmarks while delivering substantial decoding speedups, and it operates without additional training or post-processing. These results suggest practical, scalable improvements for reliable multimodal generation in real-world LVLM deployments.

Abstract

Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance for mitigating VH while maintaining competitive decoding speed. Code is available at https://github.com/mengchuang123/VASparse-github.

VASparse: Towards Efficient Visual Hallucination Mitigation via Visual-Aware Token Sparsification

TL;DR

VASparse introduces a fast, plug-and-play decoding framework to mitigate visual hallucinations in LVLMs by enforcing visual-aware token sparsification. It combines a theoretically grounded token selection mechanism, a sparse-based visual contrastive decoding that uses embeddings to avoid extra decoding, and a sinking-attention penalty to prevent language priors from overshadowing visual content. The method achieves state-of-the-art VH mitigation across CHAIR, POPE, MME, and GPT-4 assisted benchmarks while delivering substantial decoding speedups, and it operates without additional training or post-processing. These results suggest practical, scalable improvements for reliable multimodal generation in real-world LVLM deployments.

Abstract

Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which significantly impedes their real-world usage. To alleviate VH, various decoding strategies have been proposed to enhance visual information. However, many of these methods may require secondary decoding and rollback, which significantly reduces inference speed. In this work, we propose an efficient plug-and-play decoding algorithm via Visual-Aware Sparsification (VASparse) from the perspective of token sparsity for mitigating VH. VASparse is inspired by empirical observations: (1) the sparse activation of attention in LVLMs, and (2) visual-agnostic tokens sparsification exacerbates VH. Based on these insights, we propose a novel token sparsification strategy that balances efficiency and trustworthiness. Specifically, VASparse implements a visual-aware token selection strategy during decoding to reduce redundant tokens while preserving visual context effectively. Additionally, we innovatively introduce a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs without the time overhead associated with secondary decoding. Subsequently, VASparse recalibrates attention scores to penalize attention sinking of LVLMs towards text tokens. Extensive experiments across four popular benchmarks confirm the effectiveness of VASparse in mitigating VH across different LVLM families without requiring additional training or post-processing. Impressively, VASparse achieves state-of-the-art performance for mitigating VH while maintaining competitive decoding speed. Code is available at https://github.com/mengchuang123/VASparse-github.
Paper Structure (26 sections, 1 theorem, 19 equations, 11 figures, 8 tables)

This paper contains 26 sections, 1 theorem, 19 equations, 11 figures, 8 tables.

Key Result

Theorem 1

(Global Optimality): By employing the selection strategy defined in Section subsec:selection, we can obtain a globally optimal solution for the optimization problem defined in Def. def:objective. Specifically, the sparse mask $M$ derived from this selection strategy satisfies:

Figures (11)

  • Figure 1: Comparison of decoding speed and hallucination mitigation across methods using LLaVA-1.5 Liu2023VisualIT (max new tokens is 64), where a lower instance-level CHAIR score Rohrbach2018ObjectHI indicates less hallucination and higher TPS during decoding (measured by tokens generated per second) reflects greater decoding efficiency. We present the average of five runs on a single A100 GPU. Comparatively, our approach achieves both lower VH and higher efficiency.
  • Figure 2: VH evaluation and attention analysis using LLaVA-1.5 on the CHAIR benchmark: (a) token sorting by attention score; (b) token sparsification effects observed with Vanilla Top-K, FastV chen2024image, and SparseVLM zhang2024sparsevlm on sampled 500 images from the MSCOCO validation set, where Vanilla Top-K denotes keeping tokens with top-K scores in $1$-th layer; and (c) attention density distribution across various tokens.
  • Figure 3: Attention sinking phenomenon in LVLMs: in the 8-th layer and 26-th attention head of LLaVA-1.5, exhibits a substantial concentration of attention on specific tokens, e.g., <.> and <s>.
  • Figure 4: The illustration of the proposed VASparse framework, which consists of (1) the visual-aware token selection designed to prune the generated tokens during decoding; (2) a sparse-based visual contrastive decoding method to recalibrate the distribution of hallucinated outputs; and (3) the calibration strategy for punishing sinking attention.
  • Figure 5: Performance and efficiency analysis of different logit sources: (a) the impact of using different early stopping layers on LLaVA-1.5 performance; (b) the impact of using different early stopping layers on decoding speeds (TPS).
  • ...and 6 more figures

Theorems & Definitions (2)

  • Definition 1
  • Theorem 1