Table of Contents
Fetching ...

Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation

Yinjie Cheng, Paul Youssef, Christin Seifert, Jörg Schlötterer, Zhixue Zhao

TL;DR

This work investigates whether explicit knowledge edits in large language models endure when the model undergoes fine-tuning. By systematically evaluating MEMIT, AlphaEdit, and MEND edits across full FT, LoRA, and DoRA on multiple models and knowledge datasets, the authors reveal that fine-tuning generally decays edits, with the degree of decay varying by model, editing method, and the scale of edits. They also demonstrate that fine-tuning edited layers can more effectively erase edits (with downstream trade-offs), whereas tuning non-edited layers offers limited preservation of edits and often harms downstream tasks; this highlights the tight coupling between editing and adaptation. Through qualitative analyses and layer-wise activation studies, the paper explains why edits are fragile: FT shifts distributed activation directions orthogonally to editing directions, overwriting previously inserted knowledge. The study further shows that malicious edits can persist and transfer through the pipeline, underscoring safety risks and the need for KE approaches robust to downstream optimization. Overall, the authors provide empirical baselines, propose selective-layer tuning strategies, and advocate evaluating model editing within the entire LLM deployment pipeline to balance adaptability with steerability.

Abstract

Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits (Fig.1), current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edit decay after fine-tuning, investigating how fine-tuning affects knowledge editing. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.

Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation

TL;DR

This work investigates whether explicit knowledge edits in large language models endure when the model undergoes fine-tuning. By systematically evaluating MEMIT, AlphaEdit, and MEND edits across full FT, LoRA, and DoRA on multiple models and knowledge datasets, the authors reveal that fine-tuning generally decays edits, with the degree of decay varying by model, editing method, and the scale of edits. They also demonstrate that fine-tuning edited layers can more effectively erase edits (with downstream trade-offs), whereas tuning non-edited layers offers limited preservation of edits and often harms downstream tasks; this highlights the tight coupling between editing and adaptation. Through qualitative analyses and layer-wise activation studies, the paper explains why edits are fragile: FT shifts distributed activation directions orthogonally to editing directions, overwriting previously inserted knowledge. The study further shows that malicious edits can persist and transfer through the pipeline, underscoring safety risks and the need for KE approaches robust to downstream optimization. Overall, the authors provide empirical baselines, propose selective-layer tuning strategies, and advocate evaluating model editing within the entire LLM deployment pipeline to balance adaptability with steerability.

Abstract

Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits (Fig.1), current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edit decay after fine-tuning, investigating how fine-tuning affects knowledge editing. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.

Paper Structure

This paper contains 68 sections, 11 equations, 9 figures, 36 tables.

Figures (9)

  • Figure 1: An illustration of an LLM ($M$) that undergoes an edit ($M_{ed}$ ) and then fine-tuning ($M_{ed\_ft}$). This process results in the loss of edited knowledge and the production of incorrect outputs. Here is an illustrative example, we show real cases in Sec. \ref{['sec:qualitative_analysis']}.
  • Figure 2: Editing performance of Llama2 on zsRE dataset before and after fine-tuning. Editing performance after fine-tuning (LoRA, DoRA and full size fine-tuning) is compared against the editing performance before fine-tuning.
  • Figure 3: In layer-wise activation drifts (upper three) for GPT2-XL, GPT-J and Llama2, 3 categories for each model: $M_\text{ed}$, $M_\text{ft}$ and $M_\text{ed\_ft}$. In directional similarities (bottom three), 3 pairs of categories tested for each model: $M_\text{ed}$ - $M_\text{ft}$, $M_\text{ed\_ft}$ - $M_\text{ft}$ and $M_\text{ed\_ft}$ - $M_\text{ed}$. Within the red vertical dash lines are the range of layers being edited. Result specifications in App. \ref{['sec:activation_res_breakdown']}
  • Figure 4: Difference between validating values and original values across eight downstream tasks.
  • Figure 5: Difference between validating values and original values in ratio across eight downstream tasks.
  • ...and 4 more figures