Long-form evaluation of model editing
Domenic Rosati, Robie Gonzales, Jinkun Chen, Xuemin Yu, Melis Erkan, Yahya Kayani, Satya Deepika Chavatapalli, Frank Rudzicz, Hassan Sajjad
TL;DR
This work introduces the Long-form Evaluation of Model Editing (LEME), a protocol for assessing how model edits affect paragraph-length generation. It combines a Coupled Entity Prompts dataset with machine-rated surveys, human annotations, and automatic metrics to measure long-form efficacy, generalization, locality, portability, and naturalness. The study reveals weak alignment between short-form edit metrics and long-form quality, highlighting failure modes such as factual drift, lexical cohesion breakdown, and topic drift, with ROME and MEMIT showing pronounced drift. By providing automatic measures that correlate with human judgments and releasing the dataset, the paper enables more robust evaluation of long-form impacts and informs the design of future model-editing methods that maintain consistency across extended texts.
Abstract
Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
