Table of Contents
Fetching ...

Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

Zhen Zeng, Leijiang Gu, Xun Yang, Zhangling Duan, Zenglin Shi, Meng Wang

TL;DR

A visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities and a Multimodal Scope Classifier-based Knowledge Editor (MSCKE) framework is proposed, demonstrating its effectiveness in solving the complex challenges of multimodal knowledge editing.

Abstract

Knowledge editing aims to efficiently and cost-effectively correct inaccuracies and update outdated information. Recently, there has been growing interest in extending knowledge editing from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs), which integrate both textual and visual information, introducing additional editing complexities. Existing multimodal knowledge editing works primarily focus on text-oriented, coarse-grained scenarios, failing to address the unique challenges posed by multimodal contexts. In this paper, we propose a visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities. We introduce the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark to evaluate this task. Moreover, we propose a Multimodal Scope Classifier-based Knowledge Editor (MSCKE) framework. MSCKE leverages a multimodal scope classifier that integrates both visual and textual information to accurately identify and update knowledge related to specific entities within images. This approach ensures precise editing while preserving irrelevant information, overcoming the limitations of traditional text-only editing methods. Extensive experiments on the FGVEdit benchmark demonstrate that MSCKE outperforms existing methods, showcasing its effectiveness in solving the complex challenges of multimodal knowledge editing.

Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

TL;DR

A visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities and a Multimodal Scope Classifier-based Knowledge Editor (MSCKE) framework is proposed, demonstrating its effectiveness in solving the complex challenges of multimodal knowledge editing.

Abstract

Knowledge editing aims to efficiently and cost-effectively correct inaccuracies and update outdated information. Recently, there has been growing interest in extending knowledge editing from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs), which integrate both textual and visual information, introducing additional editing complexities. Existing multimodal knowledge editing works primarily focus on text-oriented, coarse-grained scenarios, failing to address the unique challenges posed by multimodal contexts. In this paper, we propose a visual-oriented, fine-grained multimodal knowledge editing task that targets precise editing in images with multiple interacting entities. We introduce the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark to evaluate this task. Moreover, we propose a Multimodal Scope Classifier-based Knowledge Editor (MSCKE) framework. MSCKE leverages a multimodal scope classifier that integrates both visual and textual information to accurately identify and update knowledge related to specific entities within images. This approach ensures precise editing while preserving irrelevant information, overcoming the limitations of traditional text-only editing methods. Extensive experiments on the FGVEdit benchmark demonstrate that MSCKE outperforms existing methods, showcasing its effectiveness in solving the complex challenges of multimodal knowledge editing.

Paper Structure

This paper contains 20 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison of fine-grained and coarse-grained knowledge editing in multimodal models. Coarse-grained knowledge editing treats the entire image as a single entity, allowing challenges to be addressed through simple text replacements. In contrast, fine-grained knowledge editing presents challenges that require editing methods to target specific entities within the image.
  • Figure 2: Architecture of the MSCKE method, illustrating the multimodal scope classifier, editing memory, base model with frozen parameters, counterfactual model with trainable parameters. During the editing phase, the editing samples are stored in the editing memory. In the inference phase, the multimodal classifier is employed to compute the similarity between the input and the editing samples in memory. Inputs with a similarity score less than 0.5 are classified as out-of-scope inputs and are processed by the base model; conversely, inputs are handled by the counterfactual model.
  • Figure 3: The construction pipeline of specificity dataset. For all questions related to a given image, the first question is selected as the classification criterion. Subsequent questions are assessed using ChatGPT. This classification process occurs in two stages: the first stage categorizes the data into in-scope and out-of-scope based on the image, while the second stage employs a text-based classification to filter out difficult data. Difficult data refers to instances where the category cannot be determined solely through textual analysis.
  • Figure 4: Comparison of classification performance between MSCKE's multimodal scope classifier and SERAC's classifier on text and image samples.