LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models

Renzhi Wang; Piji Li

LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models

Renzhi Wang, Piji Li

TL;DR

LEMoE tackles lifelong model editing for large language models by identifying catastrophic forgetting, routing inconsistency, and editing-order sensitivity as core challenges. It proposes three mechanisms—module inserting to freeze prior edits, KV anchor routing to align training and inference routes, and clustering-based order planning to optimize edit sequences—implemented within a MoE adaptor. Across LLaMA2-7B and Mistral-7B on ZsRE and SelfCheckGPT, LEMoE outperforms prior methods in lifelong editing while maintaining strong batch-editing performance, demonstrating improved reliability, generality, and locality. The approach offers a scalable, parameter-efficient path for continual knowledge updates in autoregressive LLMs, with potential for extension to larger models and varied architectures.

Abstract

Large language models (LLMs) require continual knowledge updates to stay abreast of the ever-changing world facts, prompting the formulation of lifelong model editing task. While recent years have witnessed the development of various techniques for single and batch editing, these methods either fail to apply or perform sub-optimally when faced with lifelong editing. In this paper, we introduce LEMoE, an advanced Mixture of Experts (MoE) adaptor for lifelong model editing. We first analyze the factors influencing the effectiveness of conventional MoE adaptor in lifelong editing, including catastrophic forgetting, inconsistent routing and order sensitivity. Based on these insights, we propose a tailored module insertion method to achieve lifelong editing, incorporating a novel KV anchor routing to enhance routing consistency between training and inference stage, along with a concise yet effective clustering-based editing order planning. Experimental results demonstrate the effectiveness of our method in lifelong editing, surpassing previous model editing techniques while maintaining outstanding performance in batch editing task. Our code will be available.

LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models

TL;DR

Abstract

Paper Structure (54 sections, 13 equations, 5 figures, 8 tables)

This paper contains 54 sections, 13 equations, 5 figures, 8 tables.

Introduction
Preliminaries of Model Editing
Analysis of Influencing Factors
Catastrophic Forgetting Analysis
Experiments
Results
Routing Consistency Analysis
Experiments
Results
Order Sensitivity Analysis
Experiments
Results
Methods
New Module Inserting
KV Anchor Routing
...and 39 more sections

Figures (5)

Figure 1: The conceptual framework for LEMoE. We align the expert networks in MoE architecture with data batches and freeze the expert networks corresponding to previous data when conducting current edits. $\text{Data}_i$ and $\text{FFN}_i$ represent the current data and module, with dashed line parts indicating future edits.
Figure 2: Left: Reliability of conventional MoE under different stage evaluation. "Immediate evaluation" occurs immediately after each edit, "Final evaluation" occurs after all edits in lifelong editing. Right: Visualization of routing consistency. The value $C_{ij}$ in each block denotes the proportion of the input data processed by expert $i$ during the training phase that is routed to expert $j$ during the testing phase. Model: LLaMA2-7B. Dataset: ZsRE.
Figure 3: Left: Performance variability under different editing order. Right: Within-Batch/Between-Batch Semantic Similarity v.s. Reliability.
Figure 4: The overall architecture of LEMoE compared with conventional MoE adaptor. We assume that LEMoE is currently at time $i$ to edit $\text{data}_i$ using module $\text{FFN}_i$. Left: When editing data $\text{data}_i$, the prior experts corresponding to previous data are all frozen, leaving only the new model $\text{FFN}_i$ and router trainable. Right: In the training stage, depicted by the solid lines, the routing weight $g(i \mid x)$ (abbreviated as $g_i$) is computed using the instance-level embedding and expert key vectors $\{\boldsymbol{k}_1, \boldsymbol{k}_2, \dots, \boldsymbol{k}_i\}$ for expert selection. During inference, as indicated by the dashed lines, the same routing computation is employed to direct the input to the corresponding expert.
Figure 5: Reliability, Generality and Locality of conventional MoE under different stage evaluation. "Immediate evaluation" occurs immediately after each edit, "Final evaluation" occurs after all edits in lifelong editing. Model: LLaMA2-7B. Dataset: ZsRE.

LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models

TL;DR

Abstract

LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)