Table of Contents
Fetching ...

Benchmarking and Rethinking Knowledge Editing for Large Language Models

Guoxiu He, Xin Song, Futing Wang, Aixin Sun

TL;DR

This work provides a unified benchmark for knowledge editing in large language models, highlighting the limitations of parameter-editing approaches under realistic autoregressive and sequential-editing scenarios. By evaluating a broad set of methods across multiple LLMs and both fact- and event-level knowledge, the study demonstrates that simple external-memory baselines like SCR, which rely on selective contextual reasoning, tend to outperform parameter-focused edits in robustness, generalization, and portability. The results reveal that parameter-based edits often lose multi-hop reasoning and degrade downstream capabilities as edits accumulate, while context- and memory-based strategies maintain stability. The findings argue for rethinking knowledge editing away from sole parameter modification toward retrieval-augmented and context-driven techniques, and provide a comprehensive foundation for future benchmark design and method development in this domain.

Abstract

Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.

Benchmarking and Rethinking Knowledge Editing for Large Language Models

TL;DR

This work provides a unified benchmark for knowledge editing in large language models, highlighting the limitations of parameter-editing approaches under realistic autoregressive and sequential-editing scenarios. By evaluating a broad set of methods across multiple LLMs and both fact- and event-level knowledge, the study demonstrates that simple external-memory baselines like SCR, which rely on selective contextual reasoning, tend to outperform parameter-focused edits in robustness, generalization, and portability. The results reveal that parameter-based edits often lose multi-hop reasoning and degrade downstream capabilities as edits accumulate, while context- and memory-based strategies maintain stability. The findings argue for rethinking knowledge editing away from sole parameter modification toward retrieval-augmented and context-driven techniques, and provide a comprehensive foundation for future benchmark design and method development in this domain.

Abstract

Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups. To address this gap, we conduct a comprehensive benchmarking study. In addition to fact-level datasets, we introduce more complex event-based datasets and general-purpose datasets drawn from other tasks. Our evaluation covers both instruction-tuned and reasoning-oriented LLMs, under a realistic autoregressive inference setting rather than teacher-forced decoding. Beyond single-edit assessments, we also evaluate multi-edit scenarios to better reflect practical demands. We employ four evaluation dimensions, including portability, and compare all recent methods against a simple and straightforward baseline named Selective Contextual Reasoning (SCR). Empirical results reveal that parameter-based editing methods perform poorly under realistic conditions. In contrast, SCR consistently outperforms them across all settings. This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.

Paper Structure

This paper contains 26 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Training and inference workflows for five types of knowledge editing methods. The training process is shown above the red line; the inference stage is shown below.
  • Figure 2: Performance changes of knowledge editing methods during sequential editing on the ZsRE dataset. The x-axis represents the number of edits: 1, 10, 100, and the full dataset.
  • Figure 3: Editing time (in second), and inference latency relative to the base model.
  • Figure 4: The Edited Memory is a dynamic textual knowledge base that can be expanded as needed. Phase 1: The retriever first applies semantic filtering to gather relevant information from memory based on the input question. The LLM then performs knowledge confirmation, assessing the alignment between the question and the retrieved knowledge. Phase 2: The LLM conducts conditional generation using the in-context knowledge, and the query.