Identifying Reliable Evaluation Metrics for Scientific Text Revision
Léane Jourdan, Florian Boudin, Richard Dufour, Nicolas Hernandez
TL;DR
Evaluating revisions of scientific text is challenging because traditional similarity metrics do not capture meaningful improvements. The authors combine manual annotation, reference-free domain metrics, and LLM-based judges to identify reliable evaluation strategies, using the ParaRev dataset for empirical analysis. They find that LLM judges excel at following revision instructions but struggle with correctness, while domain-specific metrics provide complementary signals; a hybrid approach yields the most reliable assessment of revision quality. The work guides scalable, human-aligned evaluation for scientific writing revisions and informs the design of future automatic evaluation frameworks.
Abstract
Evaluating text revision in scientific writing remains a challenge, as traditional metrics such as ROUGE and BERTScore primarily focus on similarity rather than capturing meaningful improvements. In this work, we analyse and identify the limitations of these metrics and explore alternative evaluation methods that better align with human judgments. We first conduct a manual annotation study to assess the quality of different revisions. Then, we investigate reference-free evaluation metrics from related NLP domains. Additionally, we examine LLM-as-a-judge approaches, analysing their ability to assess revisions with and without a gold reference. Our results show that LLMs effectively assess instruction-following but struggle with correctness, while domain-specific metrics provide complementary insights. We find that a hybrid approach combining LLM-as-a-judge evaluation and task-specific metrics offers the most reliable assessment of revision quality.
