Table of Contents
Fetching ...

ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

Mike D'Arcy, Alexis Ross, Erin Bransom, Bailey Kuehl, Jonathan Bragg, Tom Hope, Doug Downey

TL;DR

ARIES introduces two novel tasks—comment-edit alignment and edit generation—for revising scientific papers conditioned on reviewer feedback and provides a real-world dataset from OpenReview. The study shows that aligning feedback to exact edits is challenging for current models, including GPT-4, while GPT-4 can generate coherent edits that address the surface intent but often lack the depth and technical detail of human edits. The dataset combines manually curated gold annotations with a large high-precision silver set to enable scalable training and evaluation, revealing distinct error modes and the need for reasoning about indirect feedback. The work highlights both the potential and limitations of current AI-assisted writing tools in scientific domains and points to future directions for making edits more technically grounded and purpose-driven.

Abstract

We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment -- especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4's ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.

ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews

TL;DR

ARIES introduces two novel tasks—comment-edit alignment and edit generation—for revising scientific papers conditioned on reviewer feedback and provides a real-world dataset from OpenReview. The study shows that aligning feedback to exact edits is challenging for current models, including GPT-4, while GPT-4 can generate coherent edits that address the surface intent but often lack the depth and technical detail of human edits. The dataset combines manually curated gold annotations with a large high-precision silver set to enable scalable training and evaluation, revealing distinct error modes and the need for reasoning about indirect feedback. The work highlights both the potential and limitations of current AI-assisted writing tools in scientific domains and points to future directions for making edits more technically grounded and purpose-driven.

Abstract

We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment -- especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4's ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.
Paper Structure (45 sections, 2 equations, 2 figures, 8 tables)

This paper contains 45 sections, 2 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Overview of our tasks. In comment-edit alignment, a model is given a review comment and set of candidate edits derived from a source paper and a revised target paper, and it must align the comment to the edit(s) that are associated with it. In edit generation, a model is given a review comment and a source paper and must generate an edit that addresses the comment, possibly using placeholders for missing information.
  • Figure 2: Representative examples of the kinds of conditioning information used to guide edits in our work (review comments) compared to previous work which considered Wikipedia edits faltings_text_2021 and author-provided instructions ito_langsmith_2020yuan_wordcraft_2022liu_improving_2022raheja_coedit_2023. Review comments are longer and less direct, requiring more knowledge and reasoning to interpret.