Table of Contents
Fetching ...

Long-form evaluation of model editing

Domenic Rosati, Robie Gonzales, Jinkun Chen, Xuemin Yu, Melis Erkan, Yahya Kayani, Satya Deepika Chavatapalli, Frank Rudzicz, Hassan Sajjad

TL;DR

This work introduces the Long-form Evaluation of Model Editing (LEME), a protocol for assessing how model edits affect paragraph-length generation. It combines a Coupled Entity Prompts dataset with machine-rated surveys, human annotations, and automatic metrics to measure long-form efficacy, generalization, locality, portability, and naturalness. The study reveals weak alignment between short-form edit metrics and long-form quality, highlighting failure modes such as factual drift, lexical cohesion breakdown, and topic drift, with ROME and MEMIT showing pronounced drift. By providing automatic measures that correlate with human judgments and releasing the dataset, the paper enables more robust evaluation of long-form impacts and informs the design of future model-editing methods that maintain consistency across extended texts.

Abstract

Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.

Long-form evaluation of model editing

TL;DR

This work introduces the Long-form Evaluation of Model Editing (LEME), a protocol for assessing how model edits affect paragraph-length generation. It combines a Coupled Entity Prompts dataset with machine-rated surveys, human annotations, and automatic metrics to measure long-form efficacy, generalization, locality, portability, and naturalness. The study reveals weak alignment between short-form edit metrics and long-form quality, highlighting failure modes such as factual drift, lexical cohesion breakdown, and topic drift, with ROME and MEMIT showing pronounced drift. By providing automatic measures that correlate with human judgments and releasing the dataset, the paper enables more robust evaluation of long-form impacts and informs the design of future model-editing methods that maintain consistency across extended texts.

Abstract

Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
Paper Structure (43 sections, 7 figures, 19 tables)

This paper contains 43 sections, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Short-form evaluation using the next few tokens fails to measure the quality of text generated after model editing.
  • Figure 2: Example of prompts we used to generate passages to perform evaluation. The highlighted property means the subject (Champ De Mars or Eiffel Tower) is the object of that property (Where it's located or Nearby Landmarks). The edit for this example would be from "The Eiffel Tower is in Paris" to "The Eiffel Tower is in Rome"
  • Figure 3: Survey results illustrating the mean rating of long-form quality measures. Human passages always rate highest. ROME is rated even worse than no edit on many dimensions.
  • Figure 4: Proportion of labels from human annotation of ROME, human written, and no edit passages. The ground truth is mostly supported in the no edit and human control, while no edit mostly contradicts the edit statements. Human written passages generally are more consistent with the edit statement than ROME passages.
  • Figure 5: Percentage of claims that contradict the generated passage. Results corroborate our findings that MEMIT and ROME suffer from high factual drift.
  • ...and 2 more figures