Table of Contents
Fetching ...

Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

Qizhou Chen, Chengyu Wang, Dakan Wang, Taolin Zhang, Wangyue Li, Xiaofeng He

TL;DR

LiveEdit introduces a lifelong editing framework for Vision-Language LMs that uses a generative low-rank MoE to create editing experts per edit. A hard routing stage filters visually relevant editors via a trainable vision sentinel, followed by a soft routing stage that semantically fuses experts to produce updated outputs, all while the base VLLM remains frozen. The method achieves strong lifelong editing performance across multiple backbones (e.g., $d_m=1024$, $r=4$, $l_e=21$) and datasets (E-VQA, E-IC, VLKEB), with substantial gains in reliability, generality, and locality and maintains near-100% locality even with many edits. The framework is validated through extensive ablations and instance analyses, supporting its effectiveness and practical potential for real-world continuous VLLM editing without full retraining.

Abstract

Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.

Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts

TL;DR

LiveEdit introduces a lifelong editing framework for Vision-Language LMs that uses a generative low-rank MoE to create editing experts per edit. A hard routing stage filters visually relevant editors via a trainable vision sentinel, followed by a soft routing stage that semantically fuses experts to produce updated outputs, all while the base VLLM remains frozen. The method achieves strong lifelong editing performance across multiple backbones (e.g., , , ) and datasets (E-VQA, E-IC, VLKEB), with substantial gains in reliability, generality, and locality and maintains near-100% locality even with many edits. The framework is validated through extensive ablations and instance analyses, supporting its effectiveness and practical potential for real-world continuous VLLM editing without full retraining.

Abstract

Model editing aims to correct inaccurate knowledge, update outdated information, and incorporate new data into Large Language Models (LLMs) without the need for retraining. This task poses challenges in lifelong scenarios where edits must be continuously applied for real-world applications. While some editors demonstrate strong robustness for lifelong editing in pure LLMs, Vision LLMs (VLLMs), which incorporate an additional vision modality, are not directly adaptable to existing LLM editors. In this paper, we propose LiveEdit, a LIfelong Vision language modEl Edit to bridge the gap between lifelong LLM editing and VLLMs. We begin by training an editing expert generator to independently produce low-rank experts for each editing instance, with the goal of correcting the relevant responses of the VLLM. A hard filtering mechanism is developed to utilize visual semantic knowledge, thereby coarsely eliminating visually irrelevant experts for input queries during the inference stage of the post-edited model. Finally, to integrate visually relevant experts, we introduce a soft routing mechanism based on textual semantic relevance to achieve multi-expert fusion. For evaluation, we establish a benchmark for lifelong VLLM editing. Extensive experiments demonstrate that LiveEdit offers significant advantages in lifelong VLLM editing scenarios. Further experiments validate the rationality and effectiveness of each module design in LiveEdit.

Paper Structure

This paper contains 28 sections, 16 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Lifelong VLLM Editing. In this scenario, the edited VLLM is required to correctly respond to queries involving the edited data within the generalization domain, while maintaining consistent responses in locality domains. The top left shows test cases where Rel., T-Gen./M-Gen., T-Loc., and M-Loc. denote reliability, text/modal generality, and text locality, respectively. The bottom right illustrates the responses of an effectively edited VLLM across several editing timesteps.
  • Figure 2: Illustration of the LiveEdit framework. The upper part illustrates the editing process of LiveEdit. At time step $t$, the representation of an edit sample $(v_{e_t},p_{e_t},o_{e_t})$ at layer $l_e$ serves as an editing signal to generate the editing expert $(U_{e_t}, V_{e_t})$ via $f_{eg}$ and routing features $(\hat{\phi}_{v_{e_t}},\hat{\psi}_{p_{e_t}})$ via $\hat{f}_{fe}$. Both are then added to the expert repository $\mathcal{E}_{t}$. The lower part shows the VLLM inference process with LiveEdit, where $\bar{f}_{fe}$ extracts input sample features at layer $l_e$ to route editing experts, which then adapt the representation.
  • Figure 3: The impact of module dimension $d_m$ and expert rank $r$ on LiveEdit's edit performance. Experiments are conducted on the E-VQA dataset, with 1,000 edits on BLIP2. Circle size represents LiveEdit's training parameters, and color intensity indicates the average edit performance across five metrics.
  • Figure 4: The dimension control parameter $k$ for feature extraction.
  • Figure 5: Impact of LiveEdit attached layer index $l_e$. Results of 1,000 edits for BLIP2 on E-VQA dataset are reported.
  • ...and 2 more figures