Table of Contents
Fetching ...

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

Yifan Jia, Kailin Jiang, Yuyang Liang, Qihan Ren, Yi Xin, Rui Yang, Fenze Feng, Mingcai Chen, Hengyang Lu, Haozhe Wang, Xiaoye Qu, Dongrui Liu, Lizhen Cui, Yuntao Du

TL;DR

This work introduces MMKC-Bench, a multimodal knowledge conflict benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context settings for Large Multimodal Models. It constructs 1,573 knowledge instances and 3,381 images across 23 categories by combining original knowledge with counterfactual conflict knowledge generated via LLMs, and evaluates model behavior and conflict-detection capabilities using MCQ and open-ended VQA formats. The study reveals that current LMMs predominantly rely on internal parametric knowledge, show stronger sensitivity to knowledge-level conflicts than recognition-based ones, and that larger models exhibit a stronger promoting effect of internal knowledge; importantly, models can detect conflicts with reasonable accuracy in both coarse- and fine-grained settings. MMKC-Bench provides a rigorous framework for analyzing multimodal knowledge conflicts and informing the design of more reliable multimodal RAG systems, though it acknowledges distributional gaps due to synthetic counterfactual generation and calls for real-world benchmark extensions.

Abstract

Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at https://github.com/MLLMKCBENCH/MLLMKC.

Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models

TL;DR

This work introduces MMKC-Bench, a multimodal knowledge conflict benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context settings for Large Multimodal Models. It constructs 1,573 knowledge instances and 3,381 images across 23 categories by combining original knowledge with counterfactual conflict knowledge generated via LLMs, and evaluates model behavior and conflict-detection capabilities using MCQ and open-ended VQA formats. The study reveals that current LMMs predominantly rely on internal parametric knowledge, show stronger sensitivity to knowledge-level conflicts than recognition-based ones, and that larger models exhibit a stronger promoting effect of internal knowledge; importantly, models can detect conflicts with reasonable accuracy in both coarse- and fine-grained settings. MMKC-Bench provides a rigorous framework for analyzing multimodal knowledge conflicts and informing the design of more reliable multimodal RAG systems, though it acknowledges distributional gaps due to synthetic counterfactual generation and calls for real-world benchmark extensions.

Abstract

Large Multimodal Models(LMMs) face notable challenges when encountering multimodal knowledge conflicts, particularly under retrieval-augmented generation(RAG) frameworks where the contextual information from external sources may contradict the model's internal parametric knowledge, leading to unreliable outputs. However, existing benchmarks fail to reflect such realistic conflict scenarios. Most focus solely on intra-memory conflicts, while context-memory and inter-context conflicts remain largely investigated. Furthermore, commonly used factual knowledge-based evaluations are often overlooked, and existing datasets lack a thorough investigation into conflict detection capabilities. To bridge this gap, we propose MMKC-Bench, a benchmark designed to evaluate factual knowledge conflicts in both context-memory and inter-context scenarios. MMKC-Bench encompasses three types of multimodal knowledge conflicts and includes 1,573 knowledge instances and 3,381 images across 23 broad types, collected through automated pipelines with human verification. We evaluate three representative series of LMMs on both model behavior analysis and conflict detection tasks. Our findings show that while current LMMs are capable of recognizing knowledge conflicts, they tend to favor internal parametric knowledge over external evidence. We hope MMKC-Bench will foster further research in multimodal knowledge conflict and enhance the development of multimodal RAG systems. The source code is available at https://github.com/MLLMKCBENCH/MLLMKC.

Paper Structure

This paper contains 29 sections, 23 figures, 6 tables.

Figures (23)

  • Figure 1: Three types of multimodal knowledge conflict in MMKC-Bench. It is noted that the original knowledge is shown to help understand what the conflict is, and is not contained in the dataset.
  • Figure 2: The construction pipeline of MMKC-Bench.
  • Figure 3: The data types of MMKC-Bench.
  • Figure 4: The results of Qwen2.5-VL with different model sizes under context-memory conflict with multi-choice question format.
  • Figure 5: The results of Qwen2.5-VL with different model sizes under inter-context conflict with multi-choice question format.
  • ...and 18 more figures