Table of Contents
Fetching ...

ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction

Léane Jourdan, Nicolas Hernandez, Richard Dufour, Florian Boudin, Akiko Aizawa

TL;DR

This paper argues for shifting scientific text revision from a sentence-centric to a paragraph-centric paradigm, introducing ParaRev, a large dataset of revised scientific paragraphs annotated with explicit revision instructions and a structured nine-category taxonomy. ParaRev includes 48,203 revised paragraphs drawn from the CASIMIR corpus, with 641 manually annotated and a dedicated evaluation subset of 258 paragraphs, enabling instruction-guided revision research. Across multiple model families, detailed personalized instructions consistently improve revision quality over general prompts, though evaluation metrics can be biased by baselines like CopyInput and CoEdit. The work provides a practical resource and a framework for paragraph-level revision, with plans to auto-annotate the remainder of the data to support fine-tuning open-source revision models and advance writing assistance in academia.

Abstract

Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.

ParaRev: Building a dataset for Scientific Paragraph Revision annotated with revision instruction

TL;DR

This paper argues for shifting scientific text revision from a sentence-centric to a paragraph-centric paradigm, introducing ParaRev, a large dataset of revised scientific paragraphs annotated with explicit revision instructions and a structured nine-category taxonomy. ParaRev includes 48,203 revised paragraphs drawn from the CASIMIR corpus, with 641 manually annotated and a dedicated evaluation subset of 258 paragraphs, enabling instruction-guided revision research. Across multiple model families, detailed personalized instructions consistently improve revision quality over general prompts, though evaluation metrics can be biased by baselines like CopyInput and CoEdit. The work provides a practical resource and a framework for paragraph-level revision, with plans to auto-annotate the remainder of the data to support fine-tuning open-source revision models and advance writing assistance in academia.

Abstract

Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.
Paper Structure (20 sections, 4 figures, 5 tables)

This paper contains 20 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Definitions of the traditional sentence revision task and the proposed paragraph revision task.
  • Figure 2: Example of a revised paragraph with its associated revision instruction and label.
  • Figure 3: The data pipeline: annotation, paragraph revision and evaluation
  • Figure 4: Distribution of labels across the dataset overall and degree of modification of the articles.