Table of Contents
Fetching ...

Can We Edit Multimodal Large Language Models?

Siyuan Cheng, Bozhong Tian, Qingbin Liu, Xi Chen, Yongheng Wang, Huajun Chen, Ningyu Zhang

TL;DR

Editing multimodal LLMs is significantly more complex than single-modal editing due to cross-modal interactions. The authors introduce MMEdit, a benchmark for evaluating reliability, locality, and generality of edits on multimodal models, and assess a range of baselines (finetuning, MEND, KE, SERAC, IKE) using BLIP-2 OPT and MiniGPT-4. Key findings show strong reliability gains for several methods but notable challenges in stabilizing the vision module and achieving robust generalization, highlighting the need for joint, cross-modal editing strategies. The work provides public code and datasets, offering a foundation for advancing multimodal model editing research and practical deployment considerations.

Abstract

In this paper, we focus on editing Multimodal Large Language Models (MLLMs). Compared to editing single-modal LLMs, multimodal model editing is more challenging, which demands a higher level of scrutiny and careful consideration in the editing process. To facilitate research in this area, we construct a new benchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite of innovative metrics for evaluation. We conduct comprehensive experiments involving various model editing baselines and analyze the impact of editing different components for multimodal LLMs. Empirically, we notice that previous baselines can implement editing multimodal LLMs to some extent, but the effect is still barely satisfactory, indicating the potential difficulty of this task. We hope that our work can provide the NLP community with insights. Code and dataset are available in https://github.com/zjunlp/EasyEdit.

Can We Edit Multimodal Large Language Models?

TL;DR

Editing multimodal LLMs is significantly more complex than single-modal editing due to cross-modal interactions. The authors introduce MMEdit, a benchmark for evaluating reliability, locality, and generality of edits on multimodal models, and assess a range of baselines (finetuning, MEND, KE, SERAC, IKE) using BLIP-2 OPT and MiniGPT-4. Key findings show strong reliability gains for several methods but notable challenges in stabilizing the vision module and achieving robust generalization, highlighting the need for joint, cross-modal editing strategies. The work provides public code and datasets, offering a foundation for advancing multimodal model editing research and practical deployment considerations.

Abstract

In this paper, we focus on editing Multimodal Large Language Models (MLLMs). Compared to editing single-modal LLMs, multimodal model editing is more challenging, which demands a higher level of scrutiny and careful consideration in the editing process. To facilitate research in this area, we construct a new benchmark, dubbed MMEdit, for editing multimodal LLMs and establishing a suite of innovative metrics for evaluation. We conduct comprehensive experiments involving various model editing baselines and analyze the impact of editing different components for multimodal LLMs. Empirically, we notice that previous baselines can implement editing multimodal LLMs to some extent, but the effect is still barely satisfactory, indicating the potential difficulty of this task. We hope that our work can provide the NLP community with insights. Code and dataset are available in https://github.com/zjunlp/EasyEdit.
Paper Structure (39 sections, 5 equations, 7 figures, 7 tables)

This paper contains 39 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Overview of the multimodal model editing task. The editing target is to update the model's understanding of the edited input (e.g., image or text), while ensuring its interpretation of unrelated inputs remains as consistent as possible.
  • Figure 2: Utilizing multimodal LLM (e.g., BLIP-2 OPT) as an example, we dissect the comprehensive multimodal LLM into two components (Vision module and Textual module). The model's erroneous output could potentially stem from either or both of these modules. Drawing an analogy with human errors in "vision" and "speech", we apply model editing methods to these two components, thereby changing the model to refine its output.
  • Figure 3: Taking the text modality as an example, Edit target and its generalization pertain to in-scope, which involves querying the quantity of skyscrapers in a given image, while the out-of-scope refers to inquiries about the publication date. In-scope inputs require editing, whereas out-of-scope inputs remain unchanged.
  • Figure 4: Generality dataset construction process.
  • Figure 5: Generality of different editing methods.
  • ...and 2 more figures