Identifying Knowledge Editing Types in Large Language Models
Xiaopeng Li, Shasha Li, Shangwen Wang, Shezheng Song, Bin Ji, Huijun Liu, Jun Ma, Jie Yu
TL;DR
The paper tackles the risk of malicious knowledge edits in large language models by introducing Knowledge Editing Type Identification (KETI) and a corresponding benchmark, KETIBench, which covers five harmful edit types plus one benign fact update. It proposes eight baseline identifiers for both open- and closed-source LLMs and evaluates them across three editing methods (FT-M, GRACE, UnKE) and four models, reporting an average F1 of about 0.745 with evidence of cross-domain generalization. The findings show that while the detectors are feasible and somewhat robust to unseen edits, they exhibit significant error rates and struggle to fully disentangle edited versus non-edited knowledge, highlighting the need for more sophisticated, interpretable identifiers. The work provides practical insights for safer LLM deployment and guides future detector design toward leveraging richer feature representations and cross-domain transfer. Overall, KETI offers a timely approach to alert users about illicit edits and informs efforts to mitigate the societal risks of knowledge manipulation in LLMs.
Abstract
Knowledge editing has emerged as an efficient technique for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technique, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, $\textbf{K}$nowledge $\textbf{E}$diting $\textbf{T}$ype $\textbf{I}$dentification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering the most popular toxic types, as well as one benign factual edit. We develop five classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, across 92 trials involving four models and three knowledge editing methods, demonstrate that all eight baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the reliability of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI.
