The Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting
Masayo Tomita, Katsuhiko Hayashi, Tomoyuki Kaneko
TL;DR
The study investigates object hallucination in Vision-Language Models and analyzes attention-guided API Prompting as a mitigation strategy. It introduces attribution-map–based prompting from CLIP and LLaVA, investigates the role of background context via API-Seg heatmaps, and applies a Minimum Cutoff to heatmaps to refine prompting. Key findings show that preserving background information is crucial for accurate object identification, with API Prompting plus Cutoff yielding notable recall gains (approx. $3\%$) and improved alignment between attention and target objects; small objects are particularly sensitive to background context. The results guide prompting design toward enhancing visibility of target regions rather than masking content, and point to future work expanding models and datasets beyond MSCOCO to assess generalizability and robustness.
Abstract
Vision-Language Models (VLMs) occasionally generate outputs that contradict input images, constraining their reliability in real-world applications. While visual prompting is reported to suppress hallucinations by augmenting prompts with relevant area inside an image, the effectiveness in terms of the area remains uncertain. This study analyzes success and failure cases of Attention-driven visual prompting in object hallucination, revealing that preserving background context is crucial for mitigating object hallucination.
