Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models
Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji
TL;DR
The paper tackles the under-explored problem of relationship hallucinations in Large Vision-Language Models by introducing R-Bench, a mixed-answery benchmark consisting of image-level Yes/No questions and instance-level, bounding-box- or mask-attached queries derived from nocaps/COCO data. The authors demonstrate that relationship hallucinations are more severe than object hallucinations due to long-tail distributions and three co-occurrence modes (relationship–relationship, subject–relationship, relationship–object), and they reveal that LVLMs often rely on common-sense priors rather than actual visual content, with significant difficulty reasoning about spatial relations in context. Through evaluation of several popular LVLMs and analysis of counterfactual and illusion relationship hallucinations, the work highlights the need for finer image–text alignment and relational reasoning improvements. The work provides a practical benchmark, detailed data statistics, and insights that can guide future mitigation strategies for relational understanding in LVLMs, with broad implications for reliability in real-world visual reasoning tasks.
Abstract
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.
