ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Mike D'Arcy, Alexis Ross, Erin Bransom, Bailey Kuehl, Jonathan Bragg, Tom Hope, Doug Downey
TL;DR
ARIES introduces two novel tasks—comment-edit alignment and edit generation—for revising scientific papers conditioned on reviewer feedback and provides a real-world dataset from OpenReview. The study shows that aligning feedback to exact edits is challenging for current models, including GPT-4, while GPT-4 can generate coherent edits that address the surface intent but often lack the depth and technical detail of human edits. The dataset combines manually curated gold annotations with a large high-precision silver set to enable scalable training and evaluation, revealing distinct error modes and the need for reasoning about indirect feedback. The work highlights both the potential and limitations of current AI-assisted writing tools in scientific domains and points to future directions for making edits more technically grounded and purpose-driven.
Abstract
We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment -- especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4's ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.
