Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review
Qian Ruan, Iryna Gurevych
TL;DR
This work reframes author response generation as an author-in-the-loop task and introduces three core contributions: REspGen, a modular ARG framework that incorporates explicit author input, controllable planning and length, and evaluation-guided refinement; Re$^3$Align, the first large-scale dataset of aligned review–response–revision signals; and REspEval, a comprehensive evaluation suite with 20+ metrics spanning controllability, input utilization, discourse, and response quality. Across five SOTA LLMs and nine generation settings, the authors demonstrate that author input and evaluation-guided refinement improve response quality and alignment with reviewer concerns, while revealing trade-offs between richer author context and focus on core improvements, and between single- versus multi-attribute controllability. The dataset, generation framework, and evaluation tools provide a foundation for future NLP research on publishable, author-aligned rebuttal writing and broader human–AI collaboration in scholarly communication. The findings underscore the value of explicitly modeling author expertise and intent to produce more concrete, persuasive, and trustworthy ARG outputs while preserving essential human involvement in peer review.
Abstract
Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.
