Table of Contents
Fetching ...

Retention analysis of edited knowledge after fine-tuning

Fufang Wen, Shichang Zhang

TL;DR

The paper addresses how knowledge edits introduced by editing methods survive subsequent fine-tuning in LLMs. It systematically evaluates multiple KE methods across four downstream tasks and demonstrates that edited knowledge is typically more prone to forgetting than intrinsic knowledge, aligning with an elasticity-theory viewpoint that ties retention to pretraining data volume. The authors introduce two practical remedies—paraphrase-augmented edits and selective layer freezing during fine-tuning—that restore or even exceed intrinsic retention under many settings, and they provide a formal equation linking retention dynamics to data volumes: $\frac{d\gamma_{p_{\theta}}^{\mathcal{D}_{2}/\mathcal{D}}}{d l} = \Theta\left(-k\frac{d\gamma_{p_{\theta}}^{\mathcal{D}_{1}/\mathcal{D}}}{d l}\right)$ with $l = \frac{|\mathcal{D}_3|}{|\mathcal{D}_2|} \ll 1$ and $k = \frac{|\mathcal{D}_1|}{|\mathcal{D}_2|} \gg 1$. The findings show that paraphrase-rich edits and targeted freezing enable more robust retention, guiding practical deployment of KE methods in downstream pipelines.

Abstract

Large language models (LLMs) store vast amounts of knowledge, which often requires updates to correct factual errors, incorporate newly acquired information, or adapt model behavior. Model editing methods have emerged as efficient solutions for such updates, offering localized and precise knowledge modification at significantly lower computational cost than continual training. In parallel, LLMs are frequently fine-tuned for a wide range of downstream tasks. However, the effect of fine-tuning on previously edited knowledge remains poorly understood. In this work, we systematically investigate how different fine-tuning objectives interact with various model editing techniques. Our findings show that edited knowledge is substantially more susceptible to forgetting during fine-tuning than intrinsic knowledge acquired through pre-training. This analysis highlights a key limitation of current editing approaches and suggests that evaluating edit robustness under downstream fine-tuning is critical for their practical deployment. We further find that knowledge retention can be significantly improved by either augmenting edit knowledge with paraphrases or by freezing layers associated with edited content in fine-tuning stage, offering insight for developing more robust editing algorithms.

Retention analysis of edited knowledge after fine-tuning

TL;DR

The paper addresses how knowledge edits introduced by editing methods survive subsequent fine-tuning in LLMs. It systematically evaluates multiple KE methods across four downstream tasks and demonstrates that edited knowledge is typically more prone to forgetting than intrinsic knowledge, aligning with an elasticity-theory viewpoint that ties retention to pretraining data volume. The authors introduce two practical remedies—paraphrase-augmented edits and selective layer freezing during fine-tuning—that restore or even exceed intrinsic retention under many settings, and they provide a formal equation linking retention dynamics to data volumes: with and . The findings show that paraphrase-rich edits and targeted freezing enable more robust retention, guiding practical deployment of KE methods in downstream pipelines.

Abstract

Large language models (LLMs) store vast amounts of knowledge, which often requires updates to correct factual errors, incorporate newly acquired information, or adapt model behavior. Model editing methods have emerged as efficient solutions for such updates, offering localized and precise knowledge modification at significantly lower computational cost than continual training. In parallel, LLMs are frequently fine-tuned for a wide range of downstream tasks. However, the effect of fine-tuning on previously edited knowledge remains poorly understood. In this work, we systematically investigate how different fine-tuning objectives interact with various model editing techniques. Our findings show that edited knowledge is substantially more susceptible to forgetting during fine-tuning than intrinsic knowledge acquired through pre-training. This analysis highlights a key limitation of current editing approaches and suggests that evaluating edit robustness under downstream fine-tuning is critical for their practical deployment. We further find that knowledge retention can be significantly improved by either augmenting edit knowledge with paraphrases or by freezing layers associated with edited content in fine-tuning stage, offering insight for developing more robust editing algorithms.

Paper Structure

This paper contains 20 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Demonstration of model editing and downstream model fine-tuning and their impact on the knowledge in LLMs. The original model is edited with a single instance of new fact: Windows Mobile 6.5 was developed by Apple, and the edited model is fine-tuned by an irrelevant dataset, which does not contain subject, relation and object from the edited knowledge. Although the edit can be successful, it is vulnerable to different downstream fine-tuning tasks. $f_{\theta}$, $f_{\theta'}$, $f_{\theta"}$ denote the pre-trained models, edited model and fine-tuned model respectively.
  • Figure 2: Edited and intrinsic knowledge retention rate after model edit and fine-tuning for different combination of upstream edit methods and downstream fine-tuning methods. For ROME method, we choose layer 6 as the edit layer. For FT method, we choose layer 1 as the edit layer.
  • Figure 3: Different type of output tokens for the prompt "Windows Mobile 6.5 was developed by" as an example.
  • Figure 4: (a) First generated token distribution vs training epoch for edited knowledge. (b) First generated token distribution vs training epoch for intrinsic knowledge. Initially, the model predict the target token of "Microsoft" correctly. After model edition, the post-edit model predict an extremely high probability of 0.992 to the edited target token "Apple", and the true target Microsoft has very low probability(lower than 0.001). However, after just one and a half epoch of fine-tuning, this token disappears from the top-3 predicted tokens. The top-3 token rankings stabilize after one epoch of fine-tuning.
  • Figure 5: Success/Retention rate vs number of paraphrases per knowledge fact for the method of (a) MEMIT (b) AlphaEdit
  • ...and 2 more figures