Table of Contents
Fetching ...

Edisum: Summarizing and Explaining Wikipedia Edits at Scale

Marija Šakota, Isaac Johnson, Guosheng Feng, Robert West

TL;DR

Wikipedia edit summaries are crucial for moderation and research but are often missing or low quality. The authors present Edisum, a small LongT5-based model trained on a mix of real and synthetic edit-diff data to generate contextual edit summaries at scale. Synthetic data generated with GPT-3.5 and careful filtering enable training a compact model that can match human editors, while larger commercial LLMs like GPT-4 still outperform it on quality metrics. The work shows a practical, open-source friendly solution for assisting editors on a massive platform and releases a cleaned dataset for future research.

Abstract

An edit summary is a succinct comment written by a Wikipedia editor explaining the nature of, and reasons for, an edit to a Wikipedia page. Edit summaries are crucial for maintaining the encyclopedia: they are the first thing seen by content moderators and they help them decide whether to accept or reject an edit. Additionally, edit summaries constitute a valuable data source for researchers. Unfortunately, as we show, for many edits, summaries are either missing or incomplete. To overcome this problem and help editors write useful edit summaries, we propose a model for recommending edit summaries generated by a language model trained to produce good edit summaries given the representation of an edit diff. To overcome the challenges of mixed-quality training data and efficiency requirements imposed by the scale of Wikipedia, we fine-tune a small generative language model on a curated mix of human and synthetic data. Our model performs on par with human editors. Commercial large language models are able to solve this task better than human editors, but are not well suited for Wikipedia, while open-source ones fail on this task. More broadly, we showcase how language modeling technology can be used to support humans in maintaining one of the largest and most visible projects on the Web.

Edisum: Summarizing and Explaining Wikipedia Edits at Scale

TL;DR

Wikipedia edit summaries are crucial for moderation and research but are often missing or low quality. The authors present Edisum, a small LongT5-based model trained on a mix of real and synthetic edit-diff data to generate contextual edit summaries at scale. Synthetic data generated with GPT-3.5 and careful filtering enable training a compact model that can match human editors, while larger commercial LLMs like GPT-4 still outperform it on quality metrics. The work shows a practical, open-source friendly solution for assisting editors on a massive platform and releases a cleaned dataset for future research.

Abstract

An edit summary is a succinct comment written by a Wikipedia editor explaining the nature of, and reasons for, an edit to a Wikipedia page. Edit summaries are crucial for maintaining the encyclopedia: they are the first thing seen by content moderators and they help them decide whether to accept or reject an edit. Additionally, edit summaries constitute a valuable data source for researchers. Unfortunately, as we show, for many edits, summaries are either missing or incomplete. To overcome this problem and help editors write useful edit summaries, we propose a model for recommending edit summaries generated by a language model trained to produce good edit summaries given the representation of an edit diff. To overcome the challenges of mixed-quality training data and efficiency requirements imposed by the scale of Wikipedia, we fine-tune a small generative language model on a curated mix of human and synthetic data. Our model performs on par with human editors. Commercial large language models are able to solve this task better than human editors, but are not well suited for Wikipedia, while open-source ones fail on this task. More broadly, we showcase how language modeling technology can be used to support humans in maintaining one of the largest and most visible projects on the Web.
Paper Structure (25 sections, 6 figures, 6 tables)

This paper contains 25 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An example of an edit diff. The + and -- signs denote the text that was added and removed, respectively. The edit summary is the text in green in the screenshot.
  • Figure 2: Results of Edisum evaluation with MoverScore. Error bars are 95% confidence intervals (CIs). GPT-4 and GPT-3.5 perform better than Edisum, with the average MoverScore of 0.724 and 0.722, respectively. We do not show the performance of GPT-4 and GPT-3.5 credibly on y-axis for convenience, as their performance is susbstantially higher than for the other models. Note that both of these are shown as a dot on the plot, as there is no notion of the percentage of synthetic data in the training set for these models.
  • Figure 3: Results of human evaluation. Left: % of time summaries from each method are chosen as the best. Right: % of time summaries from each method are chosen as the worst. Error bars are 95% confidence intervals (CIs).
  • Figure 4: Average aggregated scores of human evaluation. Each method was scored with 1 point for winning, 0 points for losing, and 0.5 for neither winning nor losing. Error bars are 95% confidence intervals (CIs).
  • Figure 5: System instruction and the example of a demonstration used for synthetic data generation.
  • ...and 1 more figures