Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Yueqian Wang; Jianxin Liang; Yuxuan Wang; Huishuai Zhang; Dongyan Zhao

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Yueqian Wang, Jianxin Liang, Yuxuan Wang, Huishuai Zhang, Dongyan Zhao

TL;DR

This paper investigates which components of MLLMs contribute to object hallucinations, and proposes a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters.

Abstract

Hallucination is a common issue in Multimodal Large Language Models (MLLMs), yet the underlying principles remain poorly understood. In this paper, we investigate which components of MLLMs contribute to object hallucinations. To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters. Notably, Pfram can also assess the alignment of a neural representation system with the human representation system, represented by ground-truth annotations of images. By evaluating the alignment with object annotations, we demonstrate that this metric shows strong and consistent correlations with object hallucination across a wide range of state-of-the-art MLLMs, spanning various model architectures and sizes. Furthermore, using this metric, we explore other key issues related to image representations in MLLMs, such as the role of different modules, the impact of textual instructions, and potential improvements including the use of alternative visual encoders. Our code is available at: https://github.com/yellow-binary-tree/Pfram.

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 7 figures, 10 tables, 3 algorithms)

This paper contains 20 sections, 1 equation, 7 figures, 10 tables, 3 algorithms.

Introduction
Related Works
Hallucination of MLLMs
Similarity Metrics of Representations
Introduction of The Pfram Metric
The Correlation between Pfram and Object Hallucination
Implementation Details
Models and datasets.
Measuring Pframs of MLLMs.
Pfram Has Strong Correlation with Object Hallucination
Using Pfram to Diagnose MLLMs
Diagnosis of Different Modules in MLLMs
Influence of Textual Instructions to Image Representations Depends on Model
Jointly using Multiple ViTs May be Helpful
Limitations
...and 5 more sections

Figures (7)

Figure 1: Terms of different parts of MLLMs used in this paper.
Figure 2: A demonstration of the Pfram metric.
Figure 3: The POPE and Pfram scores for MLLMs with different LLM sizes (distinguished by point sizes) and different projectors (distinguished by point colors).
Figure 4: The change of Pfram with the layers in MLLM. X-axis denotes the layers in MLLM, where the left side (small number) is closer to input and the right side (large number) is closer to the output. Negative x-axis denotes the image hidden states from ViT and QFormer/Resampler (if any), x=0 denotes the image representations as the input of LLM, and positive x-asis denotes the image hidden states from the LLM. Y-axis is Pfram$(\mathcal{F}; \mathcal{G}_{obj} | \phi_{\mathrm{NDCG}}, \mathrm{OIv7})$, and the standard deviation is shown in shaded areas. Best view in color.
Figure 5: The change of Pfram with the layers in MLLM, when different textual instructions are conditioned on to acquire image representations. The definitions of axes are the same as Fig. \ref{['fig:per_layer']}. In the legend "inst_0" and "inst_1" denote the first and the second instruction is used, and "inst_rand" denote the instruction is randomly chosen. Best view in color.
...and 2 more figures

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

TL;DR

Abstract

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (7)