Table of Contents
Fetching ...

Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models

Boyu Jia, Junzhe Zhang, Huixuan Zhang, Xiaojun Wan

TL;DR

The paper investigates the inconsistent behavior of multimodal large language models (MLLMs) when integrating visual and textual knowledge. It introduces four evaluation tasks and a new multi-image, multi-hop dataset derived from MQuake to systematically probe consistency in multimodal reasoning, along with a Consistency Rate metric. Through experiments on several state-of-the-art models including GPT-4o, it shows that consistency degrades with increased reasoning hops, varies across relation types, and is influenced by task design and prompting strategies. The work provides benchmarks and insights to guide future improvements in robust multimodal reasoning for real-world applications.

Abstract

In recent years, multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision. However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning, leading to inconsistencies in reasoning outcomes. To systematically explore this issue, we propose four evaluation tasks and construct a new dataset. We conduct a series of experiments on this dataset to analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs. Based on the experimental results, we identify factors contributing to the observed degradation in consistency. Our research provides new insights into the challenges of multimodal knowledge reasoning and offers valuable guidance for future efforts aimed at improving MLLMs.

Exploring and Evaluating Multimodal Knowledge Reasoning Consistency of Multimodal Large Language Models

TL;DR

The paper investigates the inconsistent behavior of multimodal large language models (MLLMs) when integrating visual and textual knowledge. It introduces four evaluation tasks and a new multi-image, multi-hop dataset derived from MQuake to systematically probe consistency in multimodal reasoning, along with a Consistency Rate metric. Through experiments on several state-of-the-art models including GPT-4o, it shows that consistency degrades with increased reasoning hops, varies across relation types, and is influenced by task design and prompting strategies. The work provides benchmarks and insights to guide future improvements in robust multimodal reasoning for real-world applications.

Abstract

In recent years, multimodal large language models (MLLMs) have achieved significant breakthroughs, enhancing understanding across text and vision. However, current MLLMs still face challenges in effectively integrating knowledge across these modalities during multimodal knowledge reasoning, leading to inconsistencies in reasoning outcomes. To systematically explore this issue, we propose four evaluation tasks and construct a new dataset. We conduct a series of experiments on this dataset to analyze and compare the extent of consistency degradation in multimodal knowledge reasoning within MLLMs. Based on the experimental results, we identify factors contributing to the observed degradation in consistency. Our research provides new insights into the challenges of multimodal knowledge reasoning and offers valuable guidance for future efforts aimed at improving MLLMs.

Paper Structure

This paper contains 29 sections, 3 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: An example of measuring the consistency of a multimodal language model in a multimodal knowledge reasoning task. (Given three pictures of Michael Jordan and one picture of basketball star Kyrie Irving, the team Michael Jordan played for the longest time was the Chicago Bulls).
  • Figure 2: Examples of our multimodal knowledge reasoning tasks.
  • Figure 3: Inconsistency rate of different relation types in different models