DocTER: Evaluating Document-based Knowledge Editing
Suhang Wu, Ante Wang, Minlong Peng, Yujie Lin, Wenbo Li, Mingming Sun, Jinsong Su
TL;DR
This work introduces DocTER, the first benchmark for document-based knowledge editing in large language models, and demonstrates that editing with documents is significantly more challenging than editing with gold triples. It proposes an Extract-then-Edit pipeline to adapt existing triplet-based methods to document inputs and evaluates editing across four perspectives: Edit Success, Locality, Reasoning, and Cross-lingual Transfer. The study analyzes how extracted-triple quality, edit frequency, and target position impact performance, and shows external memory and reasoning-enhancement strategies can mitigate some challenges. The findings highlight practical considerations for real-world knowledge updates and point to future research directions in robust, multilingual document-based editing. The work thus advances the field by providing a realistic benchmark, a practical editing pipeline, and actionable insights for improving document-level knowledge editing.
Abstract
Knowledge editing aims to correct outdated or inaccurate knowledge in neural networks. In this paper, we explore knowledge editing using easily accessible documents instead of manually labeled factual triples employed in earlier research. To advance this field, we establish the first evaluation benchmark, \textit{DocTER}, featuring Documents containing counterfactual knowledge for editing. A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer. To adapt conventional triplet-based knowledge editing methods for this task, we develop an Extract-then-Edit pipeline that extracts triples from documents before applying existing methods. Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples. In document-based scenarios, even the best-performing in-context editing approach still lags behind by 10 points in editing success when compared to using gold triples. This observation also holds for both reasoning and cross-lingual test sets. We further analyze key factors influencing task performance, including the quality of extracted triples, the frequency and position of edited knowledge in documents, various methods for enhancing reasoning, and performance differences across various directions in cross-lingual knowledge editing, which provide valuable insights for future research.
