Table of Contents
Fetching ...

MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency

Junzhe Zhang, Huixuan Zhang, Xunjian Yin, Baizhou Huang, Xu Zhang, Xinyu Hu, Xiaojun Wan

TL;DR

This work presents MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency, and evaluates four multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency.

Abstract

Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edit distinct parts of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate four multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.

MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency

TL;DR

This work presents MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency, and evaluates four multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency.

Abstract

Multimodal large language models (MLLMs) are prone to non-factual or outdated knowledge issues, which can manifest as misreading and misrecognition errors due to the complexity of multimodal knowledge. Previous benchmarks have not systematically analyzed the performance of editing methods in correcting these two error types. To better represent and correct these errors, we decompose multimodal knowledge into its visual and textual components. Different error types correspond to different editing formats, which edit distinct parts of the multimodal knowledge. We present MC-MKE, a fine-grained Multimodal Knowledge Editing benchmark emphasizing Modality Consistency. Our benchmark facilitates independent correction of misreading and misrecognition errors by editing the corresponding knowledge component. We evaluate four multimodal knowledge editing methods on MC-MKE, revealing their limitations, particularly in terms of modality consistency. Our work highlights the challenges posed by multimodal knowledge editing and motivates further research in developing effective techniques for this task.
Paper Structure (32 sections, 10 equations, 2 figures, 11 tables)

This paper contains 32 sections, 10 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: An illustration of multimodal knowledge and the two types of multimodal errors: misrecognizing a picture of Mac Allister as Messi, and misreading Messi's football team.
  • Figure 2: The upper represents editing different components of MLLMs. The bottom provides an overview of different editing formats. With an input image and its corresponding textual knowledge $(s, r, o)$, we show three different editing formats. Although the final output is the same, the edited multimodal knowledge differs when editing its visual or textual knowledge, and the consistency property is also different given different edit inputs.