Table of Contents
Fetching ...

Revealing Multi-View Hallucination in Large Vision-Language Models

Wooje Park, Insu Lee, Soohyun Kim, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

Abstract

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

Revealing Multi-View Hallucination in Large Vision-Language Models

Abstract

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.
Paper Structure (41 sections, 12 equations, 18 figures, 6 tables)

This paper contains 41 sections, 12 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Illustration of two types of multi-view hallucination in LVLMs categorized by the source of the interference. Cross-Instance Hallucination: The model relies on information from another instance. Cross-View Hallucination: The model relies on information from another view.
  • Figure 2: Overview of the MVH-Bench construction pipeline: (a) instance-descriptor pair extraction, and (b) automated question–answer generation followed by human verification.
  • Figure 3: (a) Comparison between conventional and multi-view hallucination by their underlying causes. (b) Overview of the proposed RSCD, illustrating its core idea and underlying intuition.
  • Figure 4: Layer-wise analysis of text-to-text attention blocking for LLaVA-OneVision-7B and Qwen2.5-VL on a captioning task. The yellow shaded region indicates the layer range selected by RSCD.
  • Figure 5: Analysis of RSCD across different layer ranges and hyperparameter settings.
  • ...and 13 more figures