CoME: An Unlearning-based Approach to Conflict-free Model Editing
Dahyun Jung, Jaehyung Seo, Jaewook Lee, Chanjun Park, Heuiseok Lim
TL;DR
Knowledge in large language models is encoded as triples $G = \{(s_i,r_i,o_i)\}$, and editing aims to transform it to $G^* = \{(s_i,r_i,o^*_i)\}$ so that $f_{\theta^*}(x_i) = o^*_i$. CoME introduces unlearning via parameter subtraction to remove outdated knowledge during the integration of new information, and restricts unlearning to critical parameters to minimize collateral changes. It is designed to complement existing editing methods (e.g., MEMIT, PMET) and is evaluated on 10,000 edits from Counterfact and ZsRE using GPT-J and LLaMA-3; results show higher Efficacy and Generality with controlled Locality. Overall, the work demonstrates that selective, unlearning-based updates improve the reliability and coherence of updated LLM knowledge without sacrificing core linguistic abilities.
Abstract
Large language models (LLMs) often retain outdated or incorrect information from pre-training, which undermines their reliability. While model editing methods have been developed to address such errors without full re-training, they frequently suffer from knowledge conflicts, where outdated information interferes with new knowledge. In this work, we propose Conflict-free Model Editing (CoME), a novel framework that enhances the accuracy of knowledge updates in LLMs by selectively removing outdated knowledge. CoME leverages unlearning to mitigate knowledge interference, allowing new information to be integrated without compromising relevant linguistic features. Through experiments on GPT-J and LLaMA-3 using Counterfact and ZsRE datasets, we demonstrate that CoME improves both editing accuracy and model reliability when applied to existing editing methods. Our results highlight that the targeted removal of outdated knowledge is crucial for enhancing model editing effectiveness and maintaining the model's generative performance.
