Table of Contents
Fetching ...

Identifying Knowledge Editing Types in Large Language Models

Xiaopeng Li, Shasha Li, Shangwen Wang, Shezheng Song, Bin Ji, Huijun Liu, Jun Ma, Jie Yu

TL;DR

The paper tackles the risk of malicious knowledge edits in large language models by introducing Knowledge Editing Type Identification (KETI) and a corresponding benchmark, KETIBench, which covers five harmful edit types plus one benign fact update. It proposes eight baseline identifiers for both open- and closed-source LLMs and evaluates them across three editing methods (FT-M, GRACE, UnKE) and four models, reporting an average F1 of about 0.745 with evidence of cross-domain generalization. The findings show that while the detectors are feasible and somewhat robust to unseen edits, they exhibit significant error rates and struggle to fully disentangle edited versus non-edited knowledge, highlighting the need for more sophisticated, interpretable identifiers. The work provides practical insights for safer LLM deployment and guides future detector design toward leveraging richer feature representations and cross-domain transfer. Overall, KETI offers a timely approach to alert users about illicit edits and informs efforts to mitigate the societal risks of knowledge manipulation in LLMs.

Abstract

Knowledge editing has emerged as an efficient technique for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technique, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, $\textbf{K}$nowledge $\textbf{E}$diting $\textbf{T}$ype $\textbf{I}$dentification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering the most popular toxic types, as well as one benign factual edit. We develop five classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, across 92 trials involving four models and three knowledge editing methods, demonstrate that all eight baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the reliability of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI.

Identifying Knowledge Editing Types in Large Language Models

TL;DR

The paper tackles the risk of malicious knowledge edits in large language models by introducing Knowledge Editing Type Identification (KETI) and a corresponding benchmark, KETIBench, which covers five harmful edit types plus one benign fact update. It proposes eight baseline identifiers for both open- and closed-source LLMs and evaluates them across three editing methods (FT-M, GRACE, UnKE) and four models, reporting an average F1 of about 0.745 with evidence of cross-domain generalization. The findings show that while the detectors are feasible and somewhat robust to unseen edits, they exhibit significant error rates and struggle to fully disentangle edited versus non-edited knowledge, highlighting the need for more sophisticated, interpretable identifiers. The work provides practical insights for safer LLM deployment and guides future detector design toward leveraging richer feature representations and cross-domain transfer. Overall, KETI offers a timely approach to alert users about illicit edits and informs efforts to mitigate the societal risks of knowledge manipulation in LLMs.

Abstract

Knowledge editing has emerged as an efficient technique for updating the knowledge of large language models (LLMs), attracting increasing attention in recent years. However, there is a lack of effective measures to prevent the malicious misuse of this technique, which could lead to harmful edits in LLMs. These malicious modifications could cause LLMs to generate toxic content, misleading users into inappropriate actions. In front of this risk, we introduce a new task, nowledge diting ype dentification (KETI), aimed at identifying different types of edits in LLMs, thereby providing timely alerts to users when encountering illicit edits. As part of this task, we propose KETIBench, which includes five types of harmful edits covering the most popular toxic types, as well as one benign factual edit. We develop five classical classification models and three BERT-based models as baseline identifiers for both open-source and closed-source LLMs. Our experimental results, across 92 trials involving four models and three knowledge editing methods, demonstrate that all eight baseline identifiers achieve decent identification performance, highlighting the feasibility of identifying malicious edits in LLMs. Additional analyses reveal that the performance of the identifiers is independent of the reliability of the knowledge editing methods and exhibits cross-domain generalization, enabling the identification of edits from unknown sources. All data and code are available in https://github.com/xpq-tech/KETI.
Paper Structure (36 sections, 9 figures, 12 tables)

This paper contains 36 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Illustration of the KETI task. After an LLM has undergone both benign and harmful edits, it becomes difficult for users to distinguish whether the content generated by the LLM is a result of harmful edits. However, KETI can use technical methods to distinguish whether the edits are harmful.
  • Figure 2: Correlations between the mean metrics of all identifiers and the reliability of knowledge editing methods.
  • Figure 3: Cross domain results. FT-M$\rightarrow$GRACE indicates that the identifier is trained on features generated by LLMs edited by FT-M and tested on features generated by LLMs edited by GRACE. We annotated a portion of the cross-domain experiments, where the same position and color in each subplot represent the same type of experiment. For example, the light blue on the left of the four subplots' precision all represent the results of FT-M across different LLMs and identifiers.
  • Figure 4: Heat map of F1 scores of different identifiers across various types of edits.
  • Figure 5: Visualization of Error and Correct Predictions of LogR in Llama3.1-8B-Instruct. ER denotes error predict; T denotes groundtruth; CR denotes correct predict.
  • ...and 4 more figures