Revealing Multi-View Hallucination in Large Vision-Language Models

Wooje Park; Insu Lee; Soohyun Kim; Jaeyun Jang; Minyoung Noh; Kyuhong Shim; Byonghyo Shim

Revealing Multi-View Hallucination in Large Vision-Language Models

Wooje Park, Insu Lee, Soohyun Kim, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

Abstract

Large vision-language models (LVLMs) are increasingly being applied to multi-view image inputs captured from diverse viewpoints. However, despite this growing use, current LVLMs often confuse or mismatch visual information originating from different instances or viewpoints, a phenomenon we term multi-view hallucination. To systematically analyze this problem, we construct MVH-Bench, a benchmark comprising 4.8k question-answer pairs targeting two types of hallucination: cross-instance and cross-view. Empirical results show that recent LVLMs struggle to correctly associate visual evidence with its corresponding instance or viewpoint. To overcome this limitation, we propose Reference Shift Contrastive Decoding (RSCD), a training-free decoding technique that suppresses visual interference by generating negative logits through attention masking. Experiments on MVH-Bench with Qwen2.5-VL and LLaVA-OneVision demonstrate that RSCD consistently improves performance by up to 21.1 and 34.6 points over existing hallucination mitigation methods, highlighting the effectiveness of our approach.

Revealing Multi-View Hallucination in Large Vision-Language Models

Abstract

Paper Structure (41 sections, 12 equations, 18 figures, 6 tables)

This paper contains 41 sections, 12 equations, 18 figures, 6 tables.

Introduction
Multi-View Hallucination Benchmark
Benchmark Design
Instance-Descriptor Pairs Extraction
Automated Question-Answer Generation
Cross-Instance Hallucination.
Cross-View Hallucination.
Human Verification
MVH-Bench Evaluation
Binary Question
Multiple-Choice Question
Reference Shift Contrastive Decoding
MVH Problem Statement
Analysis of Query Context Formation
Contrastive Decoding via Reference Shift
...and 26 more sections

Figures (18)

Figure 1: Illustration of two types of multi-view hallucination in LVLMs categorized by the source of the interference. Cross-Instance Hallucination: The model relies on information from another instance. Cross-View Hallucination: The model relies on information from another view.
Figure 2: Overview of the MVH-Bench construction pipeline: (a) instance-descriptor pair extraction, and (b) automated question–answer generation followed by human verification.
Figure 3: (a) Comparison between conventional and multi-view hallucination by their underlying causes. (b) Overview of the proposed RSCD, illustrating its core idea and underlying intuition.
Figure 4: Layer-wise analysis of text-to-text attention blocking for LLaVA-OneVision-7B and Qwen2.5-VL on a captioning task. The yellow shaded region indicates the layer range selected by RSCD.
Figure 5: Analysis of RSCD across different layer ranges and hyperparameter settings.
...and 13 more figures

Revealing Multi-View Hallucination in Large Vision-Language Models

Abstract

Revealing Multi-View Hallucination in Large Vision-Language Models

Authors

Abstract

Table of Contents

Figures (18)