Table of Contents
Fetching ...

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

Mingrui Wu, Jiayi Ji, Oucheng Huang, Jiale Li, Yuhang Wu, Xiaoshuai Sun, Rongrong Ji

TL;DR

The paper tackles the under-explored problem of relationship hallucinations in Large Vision-Language Models by introducing R-Bench, a mixed-answery benchmark consisting of image-level Yes/No questions and instance-level, bounding-box- or mask-attached queries derived from nocaps/COCO data. The authors demonstrate that relationship hallucinations are more severe than object hallucinations due to long-tail distributions and three co-occurrence modes (relationship–relationship, subject–relationship, relationship–object), and they reveal that LVLMs often rely on common-sense priors rather than actual visual content, with significant difficulty reasoning about spatial relations in context. Through evaluation of several popular LVLMs and analysis of counterfactual and illusion relationship hallucinations, the work highlights the need for finer image–text alignment and relational reasoning improvements. The work provides a practical benchmark, detailed data statistics, and insights that can guide future mitigation strategies for relational understanding in LVLMs, with broad implications for reliability in real-world visual reasoning tasks.

Abstract

The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

TL;DR

The paper tackles the under-explored problem of relationship hallucinations in Large Vision-Language Models by introducing R-Bench, a mixed-answery benchmark consisting of image-level Yes/No questions and instance-level, bounding-box- or mask-attached queries derived from nocaps/COCO data. The authors demonstrate that relationship hallucinations are more severe than object hallucinations due to long-tail distributions and three co-occurrence modes (relationship–relationship, subject–relationship, relationship–object), and they reveal that LVLMs often rely on common-sense priors rather than actual visual content, with significant difficulty reasoning about spatial relations in context. Through evaluation of several popular LVLMs and analysis of counterfactual and illusion relationship hallucinations, the work highlights the need for finer image–text alignment and relational reasoning improvements. The work provides a practical benchmark, detailed data statistics, and insights that can guide future mitigation strategies for relational understanding in LVLMs, with broad implications for reliability in real-world visual reasoning tasks.

Abstract

The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.

Paper Structure

This paper contains 20 sections, 14 figures, 9 tables.

Figures (14)

  • Figure 1: The object hallucination and relationship hallucination in Large Vision-Language Models. While substantial research has addressed object hallucinations in LVLMs, the issue of relationship hallucinations remains under-explored.
  • Figure 2: Our pipeline generates image-level and instance-level questions. First, we parse all COCO captions into a relationship set. Given a nocaps image, we parse its corresponding captions into relationship triplets and match these with the relationship set to obtain a set of relationship seeds. Using GroundingDINO, we identify significant objects with bounding boxes. We then create two types of prompts based on the nocaps captions, relationship seeds, and bounding boxes. Finally, we feed these prompts into an LLM to generate image-level and instance-level questions. Additionally, we carefully filter out noisy questions to create the refined R-Bench.
  • Figure 3: The co-occurrence matrices constructed between relationship-relationship (left), subject-relationship (middle), and relationship-object (right) respectively. The matrices show the conditional probability that an element of the y-axis occurs when another element of the x-axis is happening.
  • Figure 4: The examples of relationship hallucination which arise for different reasons. The wrong answers are marked in red, the relationships in the answers are underlined, and correct answers are marked in green.
  • Figure 5: The probability of relationship hallucination when "man swings bat" occurs. The co-occurrence frequencies of these relationships with "man swings bat" decrease from left to right.
  • ...and 9 more figures