Table of Contents
Fetching ...

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

Yueqian Wang, Jianxin Liang, Yuxuan Wang, Huishuai Zhang, Dongyan Zhao

TL;DR

This paper investigates which components of MLLMs contribute to object hallucinations, and proposes a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters.

Abstract

Hallucination is a common issue in Multimodal Large Language Models (MLLMs), yet the underlying principles remain poorly understood. In this paper, we investigate which components of MLLMs contribute to object hallucinations. To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters. Notably, Pfram can also assess the alignment of a neural representation system with the human representation system, represented by ground-truth annotations of images. By evaluating the alignment with object annotations, we demonstrate that this metric shows strong and consistent correlations with object hallucination across a wide range of state-of-the-art MLLMs, spanning various model architectures and sizes. Furthermore, using this metric, we explore other key issues related to image representations in MLLMs, such as the role of different modules, the impact of textual instructions, and potential improvements including the use of alternative visual encoders. Our code is available at: https://github.com/yellow-binary-tree/Pfram.

Understanding Multimodal Hallucination with Parameter-Free Representation Alignment

TL;DR

This paper investigates which components of MLLMs contribute to object hallucinations, and proposes a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters.

Abstract

Hallucination is a common issue in Multimodal Large Language Models (MLLMs), yet the underlying principles remain poorly understood. In this paper, we investigate which components of MLLMs contribute to object hallucinations. To analyze image representations while completely avoiding the influence of all other factors other than the image representation itself, we propose a parametric-free representation alignment metric (Pfram) that can measure the similarities between any two representation systems without requiring additional training parameters. Notably, Pfram can also assess the alignment of a neural representation system with the human representation system, represented by ground-truth annotations of images. By evaluating the alignment with object annotations, we demonstrate that this metric shows strong and consistent correlations with object hallucination across a wide range of state-of-the-art MLLMs, spanning various model architectures and sizes. Furthermore, using this metric, we explore other key issues related to image representations in MLLMs, such as the role of different modules, the impact of textual instructions, and potential improvements including the use of alternative visual encoders. Our code is available at: https://github.com/yellow-binary-tree/Pfram.
Paper Structure (20 sections, 1 equation, 7 figures, 10 tables, 3 algorithms)

This paper contains 20 sections, 1 equation, 7 figures, 10 tables, 3 algorithms.

Figures (7)

  • Figure 1: Terms of different parts of MLLMs used in this paper.
  • Figure 2: A demonstration of the Pfram metric.
  • Figure 3: The POPE and Pfram scores for MLLMs with different LLM sizes (distinguished by point sizes) and different projectors (distinguished by point colors).
  • Figure 4: The change of Pfram with the layers in MLLM. X-axis denotes the layers in MLLM, where the left side (small number) is closer to input and the right side (large number) is closer to the output. Negative x-axis denotes the image hidden states from ViT and QFormer/Resampler (if any), x=0 denotes the image representations as the input of LLM, and positive x-asis denotes the image hidden states from the LLM. Y-axis is Pfram$(\mathcal{F}; \mathcal{G}_{obj} | \phi_{\mathrm{NDCG}}, \mathrm{OIv7})$, and the standard deviation is shown in shaded areas. Best view in color.
  • Figure 5: The change of Pfram with the layers in MLLM, when different textual instructions are conditioned on to acquire image representations. The definitions of axes are the same as Fig. \ref{['fig:per_layer']}. In the legend "inst_0" and "inst_1" denote the first and the second instruction is used, and "inst_rand" denote the instruction is randomly chosen. Best view in color.
  • ...and 2 more figures