Table of Contents
Fetching ...

Understanding and Evaluating Hallucinations in 3D Visual Language Models

Ruiying Peng, Kaiyuan Li, Weichen Zhang, Chen Gao, Xinlei Chen, Yong Li

TL;DR

Understanding and Evaluating Hallucinations in 3D Visual Language Models investigates hallucinations in 3D-LLMs and identifies data bias as a major driver. It defines 3D hallucination types, introduces two evaluation strategies Random Point Cloud Pair and Opposite Question Evaluations, and presents a detection benchmark with metrics including $HR_{random}$ and $HR_{opposite}$. Experiments on 3DLLM and LL3DA show pervasive object and spatial relation hallucinations driven by imbalanced object frequencies and strong object correlations in datasets. The work highlights the gap between visual grounding and textual priors in current 3D-LLMs and offers data-centric steps toward more faithful spatial reasoning in embodied scene understanding.

Abstract

Recently, 3D-LLMs, which combine point-cloud encoders with large models, have been proposed to tackle complex tasks in embodied intelligence and scene understanding. In addition to showing promising results on 3D tasks, we found that they are significantly affected by hallucinations. For instance, they may generate objects that do not exist in the scene or produce incorrect relationships between objects. To investigate this issue, this work presents the first systematic study of hallucinations in 3D-LLMs. We begin by quickly evaluating hallucinations in several representative 3D-LLMs and reveal that they are all significantly affected by hallucinations. We then define hallucinations in 3D scenes and, through a detailed analysis of datasets, uncover the underlying causes of these hallucinations. We find three main causes: (1) Uneven frequency distribution of objects in the dataset. (2) Strong correlations between objects. (3) Limited diversity in object attributes. Additionally, we propose new evaluation metrics for hallucinations, including Random Point Cloud Pair and Opposite Question Evaluations, to assess whether the model generates responses based on visual information and aligns it with the text's meaning.

Understanding and Evaluating Hallucinations in 3D Visual Language Models

TL;DR

Understanding and Evaluating Hallucinations in 3D Visual Language Models investigates hallucinations in 3D-LLMs and identifies data bias as a major driver. It defines 3D hallucination types, introduces two evaluation strategies Random Point Cloud Pair and Opposite Question Evaluations, and presents a detection benchmark with metrics including and . Experiments on 3DLLM and LL3DA show pervasive object and spatial relation hallucinations driven by imbalanced object frequencies and strong object correlations in datasets. The work highlights the gap between visual grounding and textual priors in current 3D-LLMs and offers data-centric steps toward more faithful spatial reasoning in embodied scene understanding.

Abstract

Recently, 3D-LLMs, which combine point-cloud encoders with large models, have been proposed to tackle complex tasks in embodied intelligence and scene understanding. In addition to showing promising results on 3D tasks, we found that they are significantly affected by hallucinations. For instance, they may generate objects that do not exist in the scene or produce incorrect relationships between objects. To investigate this issue, this work presents the first systematic study of hallucinations in 3D-LLMs. We begin by quickly evaluating hallucinations in several representative 3D-LLMs and reveal that they are all significantly affected by hallucinations. We then define hallucinations in 3D scenes and, through a detailed analysis of datasets, uncover the underlying causes of these hallucinations. We find three main causes: (1) Uneven frequency distribution of objects in the dataset. (2) Strong correlations between objects. (3) Limited diversity in object attributes. Additionally, we propose new evaluation metrics for hallucinations, including Random Point Cloud Pair and Opposite Question Evaluations, to assess whether the model generates responses based on visual information and aligns it with the text's meaning.

Paper Structure

This paper contains 22 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: In 3D scenes, the relationships between objects are significantly more complex than those in text or images. The left side of the figure illustrates hallucinations related to relative positional relationships and absolute positional relationships, while the right side demonstrates attribute hallucinations such as color, size, and shape.
  • Figure 2: Object hallucination evaluation for 3D LLMs. Precision measures the proportion of described objects that exist in the scene, while recall represents the proportion of scene objects that are described.
  • Figure 3: (1): The relationship between object hallucination rates in 3DLLM and LL3DA and object occurrence frequencies in the dataset is shown in figures a and b.(2): The relationship between strong object correlations and object hallucination rates are shown in figure c.
  • Figure 4: In the evaluation process, we generate new QA pairs by changing the scene while keeping the questions fixed: different scenes are randomly selected to form new QA pairs. Additionally, we modify the questions while keeping the scene fixed: spatial relationship-related questions are selected, and all QA pairs are transformed such that the object A is the focus. Then, the spatial relationship in the questions is inverted, generating new QA pairs.
  • Figure 5: Impact of Attribute Simplicity on Accuracy.ROUGE represents the average quality of question-answer pairs for a specific item, while the Top 3 Ratio is the proportion of the three most common attributes of the item.