Can We Automate Scientific Reviewing?
Weizhe Yuan, Pengfei Liu, Graham Neubig
TL;DR
The paper tackles the challenge of accelerating scientific peer review amid a flood of publications by exploring automated first-pass reviews. It introduces the ASAP-Review dataset and the ReviewAdvisor framework, which formulate review generation as aspect-based summarization and employ a two-stage extract-then-generate approach to handle long documents. Through extensive automatic and human evaluations, the authors show that automated reviews can be more comprehensive and evidence-supported, but suffer from factual inaccuracies and limited high-level reasoning, bias, and lack of critical questioning. They conclude that automated review systems are best used as machine-assisted tools to support human reviewers, and they outline eight concrete challenges and directions for future work to improve reliability, bias mitigation, and evaluation in this domain.
Abstract
The rapid development of science and technology has been accompanied by an exponential growth in peer-reviewed scientific publications. At the same time, the review of each paper is a laborious process that must be carried out by subject matter experts. Thus, providing high-quality reviews of this growing number of papers is a significant challenge. In this work, we ask the question "can we automate scientific reviewing?", discussing the possibility of using state-of-the-art natural language processing (NLP) models to generate first-pass peer reviews for scientific papers. Arguably the most difficult part of this is defining what a "good" review is in the first place, so we first discuss possible evaluation measures for such reviews. We then collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. Comprehensive experimental results show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews, but the generated text can suffer from lower constructiveness for all aspects except the explanation of the core ideas of the papers, which are largely factually correct. We finally summarize eight challenges in the pursuit of a good review generation system together with potential solutions, which, hopefully, will inspire more future research on this subject. We make all code, and the dataset publicly available: https://github.com/neulab/ReviewAdvisor, as well as a ReviewAdvisor system: http://review.nlpedia.ai/.
