Table of Contents
Fetching ...

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

Junzhe Chen, Tianshu Zhang, Shiyu Huang, Yuwei Niu, Linfeng Zhang, Lijie Wen, Xuming Hu

TL;DR

ICT, a lightweight, training-free method that calculates an intervention direction to shift the model’s focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details, effectively mitigating the phenomenon of overly language priors and thereby alleviating hallucinations.

Abstract

Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in understanding and responding to complex visual-textual contexts, their inherent hallucination tendencies limit their practical application in real-world scenarios that demand high levels of precision. Existing methods typically either fine-tune the LVLMs using additional data, which incurs extra costs in manual annotation and computational resources or perform comparisons at the decoding stage, which may eliminate useful language priors for reasoning while introducing inference time overhead. Therefore, we propose ICT, a lightweight, training-free method that calculates an intervention direction to shift the model's focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details. During the forward pass stage, the intervention is applied to the attention heads that encode the overall image information and the fine-grained object details, effectively mitigating the phenomenon of overly language priors, and thereby alleviating hallucinations. Extensive experiments demonstrate that ICT achieves strong performance with a small amount of data and generalizes well across different datasets and models. Our code will be public.

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models

TL;DR

ICT, a lightweight, training-free method that calculates an intervention direction to shift the model’s focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details, effectively mitigating the phenomenon of overly language priors and thereby alleviating hallucinations.

Abstract

Despite the recent breakthroughs achieved by Large Vision Language Models (LVLMs) in understanding and responding to complex visual-textual contexts, their inherent hallucination tendencies limit their practical application in real-world scenarios that demand high levels of precision. Existing methods typically either fine-tune the LVLMs using additional data, which incurs extra costs in manual annotation and computational resources or perform comparisons at the decoding stage, which may eliminate useful language priors for reasoning while introducing inference time overhead. Therefore, we propose ICT, a lightweight, training-free method that calculates an intervention direction to shift the model's focus towards different levels of visual information, enhancing its attention to high-level and fine-grained visual details. During the forward pass stage, the intervention is applied to the attention heads that encode the overall image information and the fine-grained object details, effectively mitigating the phenomenon of overly language priors, and thereby alleviating hallucinations. Extensive experiments demonstrate that ICT achieves strong performance with a small amount of data and generalizes well across different datasets and models. Our code will be public.

Paper Structure

This paper contains 23 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Comparison between Contrastive Decoding (top) and our proposed ICT (bottom). In the top example, Contrastive Decoding indiscriminately removes both beneficial and detrimental language priors, leading to hallucinations. In contrast, our approach enhances the model's attention to visual details while preserving useful language priors, allowing it to correctly identify and describe objects in the image.
  • Figure 2: Overview of our proposed ICT method. ICT consists of two levels of intervention: Image-Level and Object-Level. The Image-Level module enhances the model’s focus on the overall scene to reduce reliance on language priors, while the Object-Level module focuses on fine-grained details to mitigate hallucinations. We apply targeted activation shifts to selected attention heads identified through binary classifiers trained to distinguish trusted and untrusted data pairs.
  • Figure 3: Comparison of ICT with baseline methods (Vanilla and VCD) on the MME benchmark. The radar chart illustrates improvements across various evaluation categories, including existence, position, count, color, and commonsense QA (CSQA).
  • Figure 4: t-SNE visualization of Object-Level and Image-Level offset vectors for LLaVA-v1.5 and Qwen-VL at layers 16 and 18.
  • Figure 5: Case Study and Error Analysis of ICT.
  • ...and 3 more figures