Can Fine-Tuning Erase Your Edits? On the Fragile Coexistence of Knowledge Editing and Adaptation
Yinjie Cheng, Paul Youssef, Christin Seifert, Jörg Schlötterer, Zhixue Zhao
TL;DR
This work investigates whether explicit knowledge edits in large language models endure when the model undergoes fine-tuning. By systematically evaluating MEMIT, AlphaEdit, and MEND edits across full FT, LoRA, and DoRA on multiple models and knowledge datasets, the authors reveal that fine-tuning generally decays edits, with the degree of decay varying by model, editing method, and the scale of edits. They also demonstrate that fine-tuning edited layers can more effectively erase edits (with downstream trade-offs), whereas tuning non-edited layers offers limited preservation of edits and often harms downstream tasks; this highlights the tight coupling between editing and adaptation. Through qualitative analyses and layer-wise activation studies, the paper explains why edits are fragile: FT shifts distributed activation directions orthogonally to editing directions, overwriting previously inserted knowledge. The study further shows that malicious edits can persist and transfer through the pipeline, underscoring safety risks and the need for KE approaches robust to downstream optimization. Overall, the authors provide empirical baselines, propose selective-layer tuning strategies, and advocate evaluating model editing within the entire LLM deployment pipeline to balance adaptability with steerability.
Abstract
Knowledge editing has emerged as a lightweight alternative to retraining for correcting or injecting specific facts in large language models (LLMs). Meanwhile, fine-tuning remains the default operation for adapting LLMs to new domains and tasks. Despite their widespread adoption, these two post-training interventions have been studied in isolation, leaving open a crucial question: if we fine-tune an edited model, do the edits survive? This question is motivated by two practical scenarios: removing covert or malicious edits, and preserving beneficial edits. If fine-tuning impairs edits (Fig.1), current KE methods become less useful, as every fine-tuned model would require re-editing, which significantly increases the cost; if edits persist, fine-tuned models risk propagating hidden malicious edits, raising serious safety concerns. To this end, we systematically quantify edit decay after fine-tuning, investigating how fine-tuning affects knowledge editing. Our results show that edits decay after fine-tuning, with survival varying across configurations, e.g., AlphaEdit edits decay more than MEMIT edits. Further, we find that fine-tuning edited layers only can effectively remove edits, though at a slight cost to downstream performance. Surprisingly, fine-tuning non-edited layers impairs more edits than full fine-tuning. Overall, our study establishes empirical baselines and actionable strategies for integrating knowledge editing with fine-tuning, and underscores that evaluating model editing requires considering the full LLM application pipeline.
