Table of Contents
Fetching ...

The Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting

Masayo Tomita, Katsuhiko Hayashi, Tomoyuki Kaneko

TL;DR

The study investigates object hallucination in Vision-Language Models and analyzes attention-guided API Prompting as a mitigation strategy. It introduces attribution-map–based prompting from CLIP and LLaVA, investigates the role of background context via API-Seg heatmaps, and applies a Minimum Cutoff to heatmaps to refine prompting. Key findings show that preserving background information is crucial for accurate object identification, with API Prompting plus Cutoff yielding notable recall gains (approx. $3\%$) and improved alignment between attention and target objects; small objects are particularly sensitive to background context. The results guide prompting design toward enhancing visibility of target regions rather than masking content, and point to future work expanding models and datasets beyond MSCOCO to assess generalizability and robustness.

Abstract

Vision-Language Models (VLMs) occasionally generate outputs that contradict input images, constraining their reliability in real-world applications. While visual prompting is reported to suppress hallucinations by augmenting prompts with relevant area inside an image, the effectiveness in terms of the area remains uncertain. This study analyzes success and failure cases of Attention-driven visual prompting in object hallucination, revealing that preserving background context is crucial for mitigating object hallucination.

The Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting

TL;DR

The study investigates object hallucination in Vision-Language Models and analyzes attention-guided API Prompting as a mitigation strategy. It introduces attribution-map–based prompting from CLIP and LLaVA, investigates the role of background context via API-Seg heatmaps, and applies a Minimum Cutoff to heatmaps to refine prompting. Key findings show that preserving background information is crucial for accurate object identification, with API Prompting plus Cutoff yielding notable recall gains (approx. ) and improved alignment between attention and target objects; small objects are particularly sensitive to background context. The results guide prompting design toward enhancing visibility of target regions rather than masking content, and point to future work expanding models and datasets beyond MSCOCO to assess generalizability and robustness.

Abstract

Vision-Language Models (VLMs) occasionally generate outputs that contradict input images, constraining their reliability in real-world applications. While visual prompting is reported to suppress hallucinations by augmenting prompts with relevant area inside an image, the effectiveness in terms of the area remains uncertain. This study analyzes success and failure cases of Attention-driven visual prompting in object hallucination, revealing that preserving background context is crucial for mitigating object hallucination.

Paper Structure

This paper contains 24 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: API Prompting Process: Without Cutoff api vs. With Cutoff (Proposed).
  • Figure 2: Process of Visual Attention Evaluation.
  • Figure 3: Image Size vs. POPE Results.
  • Figure 4: Examples of Cutoff Segmentation.
  • Figure 5: Examples of Successful Cases of Objects Present in Images.
  • ...and 5 more figures