VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark

Han Huang; Haitian Zhong; Tao Yu; Qiang Liu; Shu Wu; Liang Wang; Tieniu Tan

VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark

Han Huang, Haitian Zhong, Tao Yu, Qiang Liu, Shu Wu, Liang Wang, Tieniu Tan

TL;DR

VLKEB introduces a dedicated large vision-language model knowledge editing benchmark that leverages a multi-modal knowledge graph to ground edits in real images and entities. It extends the Portability metric and provides a comprehensive evaluation framework across five LVLMs with multiple editing methods, uncovering strengths and weaknesses in reliability, generality, locality, and cross-content transfer. The experiments reveal that in-context and memory-based approaches often excel in single-edit scenarios and portability, while parameter-update methods, including fine-tuning, struggle with long-horizon or multi-hop edits, highlighting the need for LVLM-specific editing strategies. The work offers practical insights and a valuable dataset to propel research on robust, transferable knowledge editing for multi-modal models, with clear directions for improving portability and handling sequential edits.

Abstract

Recently, knowledge editing on large language models (LLMs) has received considerable attention. Compared to this, editing Large Vision-Language Models (LVLMs) faces extra challenges from diverse data modalities and complicated model components, and data for LVLMs editing are limited. The existing LVLM editing benchmark, which comprises three metrics (Reliability, Locality, and Generality), falls short in the quality of synthesized evaluation images and cannot assess whether models apply edited knowledge in relevant content. Therefore, we employ more reliable data collection methods to construct a new Large $\textbf{V}$ision-$\textbf{L}$anguage Model $\textbf{K}$nowledge $\textbf{E}$diting $\textbf{B}$enchmark, $\textbf{VLKEB}$, and extend the Portability metric for more comprehensive evaluation. Leveraging a multi-modal knowledge graph, our image data are bound with knowledge entities. This can be further used to extract entity-related knowledge, which constitutes the base of editing data. We conduct experiments of different editing methods on five LVLMs, and thoroughly analyze how do they impact the models. The results reveal strengths and deficiencies of these methods and hopefully provide insights for future research. The codes and dataset are available at: https://github.com/VLKEB/VLKEB.

VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark

TL;DR

Abstract

ision-

anguage Model

nowledge

diting

enchmark,

, and extend the Portability metric for more comprehensive evaluation. Leveraging a multi-modal knowledge graph, our image data are bound with knowledge entities. This can be further used to extract entity-related knowledge, which constitutes the base of editing data. We conduct experiments of different editing methods on five LVLMs, and thoroughly analyze how do they impact the models. The results reveal strengths and deficiencies of these methods and hopefully provide insights for future research. The codes and dataset are available at: https://github.com/VLKEB/VLKEB.

Paper Structure (60 sections, 7 equations, 7 figures, 12 tables)

This paper contains 60 sections, 7 equations, 7 figures, 12 tables.

Introduction
Related Works
LLM Editing Benchmarks
LLM Editing Methods
Large Vision-Language Models
Dataset Construction
Problem Formulation
Metrics: Reliability, Generality, Locality and Portability
Construction Process
Preparation
Image Selection
Reliability, Generality and Locality Evaluation Data Construction
Portability Evaluation Data Construction
Dataset Summary
Experiments
...and 45 more sections

Figures (7)

Figure 1: The image belongs to "Wichita Falls" originally and the editing target is to make LVLM recognize it as "Fort Smith". The answer from LVLM measures the edit Reliability. The Generality inputs are "rephrased" images (i.e. belong to the same entity but different in perspective or appearance) and rephrased questions. Locality inputs are unrelated images and questions. Portability inputs are generated from sampled triples containing editing entity 'Fort Smith' from the knowledge graph.
Figure 2: In Fig.\ref{['fig:signle']}, the single editing takes one edit at a time and evaluate immediately, while in Fig.\ref{['fig:sequential']} the sequential editing involves continuous edits and test after several other edits.
Figure 3: Relative change (compared with unedited base model) of Multi-hop Portability results.
Figure 4: Average results in sequential editing. Horizontal axis is the test gap number in Fig.\ref{['fig:sequential']}.
Figure 5: The class proportions of entities.
...and 2 more figures

VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark

TL;DR

Abstract

VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (7)