Table of Contents
Fetching ...

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

Fan Yuan, Chi Qin, Xiaogang Xu, Piji Li

TL;DR

This work proposes Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD), a framework that effectively mitigates hallucination for different LVLMs and concurrently improves their text generation quality.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks. However, these models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images. Many existing work detects hallucination by directly judging whether an object exists in an image, overlooking the association between the object and semantics. To address this issue, we propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD). This framework incorporates hallucination feedback at both object and sentence semantic levels. Remarkably, even with a marginal degree of training, this approach can alleviate over 15% of hallucination. Simultaneously, HELPD penalizes the output logits according to the image attention window to avoid being overly affected by generated text. HELPD can be seamlessly integrated with any LVLMs. Our experiments demonstrate that the proposed framework yields favorable results across multiple hallucination benchmarks. It effectively mitigates hallucination for different LVLMs and concurrently improves their text generation quality.

HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding

TL;DR

This work proposes Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD), a framework that effectively mitigates hallucination for different LVLMs and concurrently improves their text generation quality.

Abstract

Large Vision-Language Models (LVLMs) have shown remarkable performance on many visual-language tasks. However, these models still suffer from multimodal hallucination, which means the generation of objects or content that violates the images. Many existing work detects hallucination by directly judging whether an object exists in an image, overlooking the association between the object and semantics. To address this issue, we propose Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding (HELPD). This framework incorporates hallucination feedback at both object and sentence semantic levels. Remarkably, even with a marginal degree of training, this approach can alleviate over 15% of hallucination. Simultaneously, HELPD penalizes the output logits according to the image attention window to avoid being overly affected by generated text. HELPD can be seamlessly integrated with any LVLMs. Our experiments demonstrate that the proposed framework yields favorable results across multiple hallucination benchmarks. It effectively mitigates hallucination for different LVLMs and concurrently improves their text generation quality.
Paper Structure (30 sections, 14 equations, 10 figures, 12 tables)

This paper contains 30 sections, 14 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: A case of LVLM hallucination. The parts marked in red are, in fact, hallucinations. The parts marked in blue would be mistaken for hallucinations by detection methods that focus only on objects.
  • Figure 2: Attention visualization of LVLMs. For the same input, each image represents the attention matrix of a specific LVLM generation instance. Red indicates the attention of the image, while green represents the phenomenon of "Over-trust" in the generated text.
  • Figure 3: This diagram illustrates the framework of HELPD. The Hierarchical Feedback Learning detects hallucination by obtaining object-level feedback from comparing object sets extracted from sampled and label sentences, and sentence-level feedback through semantic comparison using GPT-4's few-shot inference capabilities. To improve the effectiveness of sampling, the Vision Penalty Decoding augments the over-trust penalty score with a vision-enhanced penalty score, making the final logits closer to the image.
  • Figure 4: The illustration of Vision-enhanced Penalty Decoding. The total penalty is composed of the vision penalty and the over-trust penalty. The over-trust penalty is computed based on the generated text (the upper region), while the vision penalty is computed from the vision attention window (the lower area).
  • Figure 5: Detailed performance of LVLMs on the eight categories in MMHAL-Bench, where "Overall" indicates the averaged performance across all categories. "w/ ours" means the application of HELPD.
  • ...and 5 more figures