Table of Contents
Fetching ...

Can We Automate Scientific Reviewing?

Weizhe Yuan, Pengfei Liu, Graham Neubig

TL;DR

The paper tackles the challenge of accelerating scientific peer review amid a flood of publications by exploring automated first-pass reviews. It introduces the ASAP-Review dataset and the ReviewAdvisor framework, which formulate review generation as aspect-based summarization and employ a two-stage extract-then-generate approach to handle long documents. Through extensive automatic and human evaluations, the authors show that automated reviews can be more comprehensive and evidence-supported, but suffer from factual inaccuracies and limited high-level reasoning, bias, and lack of critical questioning. They conclude that automated review systems are best used as machine-assisted tools to support human reviewers, and they outline eight concrete challenges and directions for future work to improve reliability, bias mitigation, and evaluation in this domain.

Abstract

The rapid development of science and technology has been accompanied by an exponential growth in peer-reviewed scientific publications. At the same time, the review of each paper is a laborious process that must be carried out by subject matter experts. Thus, providing high-quality reviews of this growing number of papers is a significant challenge. In this work, we ask the question "can we automate scientific reviewing?", discussing the possibility of using state-of-the-art natural language processing (NLP) models to generate first-pass peer reviews for scientific papers. Arguably the most difficult part of this is defining what a "good" review is in the first place, so we first discuss possible evaluation measures for such reviews. We then collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. Comprehensive experimental results show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews, but the generated text can suffer from lower constructiveness for all aspects except the explanation of the core ideas of the papers, which are largely factually correct. We finally summarize eight challenges in the pursuit of a good review generation system together with potential solutions, which, hopefully, will inspire more future research on this subject. We make all code, and the dataset publicly available: https://github.com/neulab/ReviewAdvisor, as well as a ReviewAdvisor system: http://review.nlpedia.ai/.

Can We Automate Scientific Reviewing?

TL;DR

The paper tackles the challenge of accelerating scientific peer review amid a flood of publications by exploring automated first-pass reviews. It introduces the ASAP-Review dataset and the ReviewAdvisor framework, which formulate review generation as aspect-based summarization and employ a two-stage extract-then-generate approach to handle long documents. Through extensive automatic and human evaluations, the authors show that automated reviews can be more comprehensive and evidence-supported, but suffer from factual inaccuracies and limited high-level reasoning, bias, and lack of critical questioning. They conclude that automated review systems are best used as machine-assisted tools to support human reviewers, and they outline eight concrete challenges and directions for future work to improve reliability, bias mitigation, and evaluation in this domain.

Abstract

The rapid development of science and technology has been accompanied by an exponential growth in peer-reviewed scientific publications. At the same time, the review of each paper is a laborious process that must be carried out by subject matter experts. Thus, providing high-quality reviews of this growing number of papers is a significant challenge. In this work, we ask the question "can we automate scientific reviewing?", discussing the possibility of using state-of-the-art natural language processing (NLP) models to generate first-pass peer reviews for scientific papers. Arguably the most difficult part of this is defining what a "good" review is in the first place, so we first discuss possible evaluation measures for such reviews. We then collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. Comprehensive experimental results show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews, but the generated text can suffer from lower constructiveness for all aspects except the explanation of the core ideas of the papers, which are largely factually correct. We finally summarize eight challenges in the pursuit of a good review generation system together with potential solutions, which, hopefully, will inspire more future research on this subject. We make all code, and the dataset publicly available: https://github.com/neulab/ReviewAdvisor, as well as a ReviewAdvisor system: http://review.nlpedia.ai/.

Paper Structure

This paper contains 93 sections, 13 equations, 11 figures, 16 tables.

Figures (11)

  • Figure 1: Data annotation pipeline.
  • Figure 2: (a) and (b) represent distributions over seven aspects obtained by human and BERT-based tagger respectively. Red bins represent positive sentiment while green ones suggest negative sentiment. We omit "Sum" aspect since there is no polarity definition of it.
  • Figure 3: Summarization from three different views for the paper "Attention Is All You Need" vaswani2017attention. Summareis from three views (author, reader, reviewer) comes from the paper's abstract, citance (i.e., a paper that cites this paper) and peer review respectively.
  • Figure 4: Selected sentence position distribution. We use the relative position of each sentence with regard to the whole article, thus taking values from 0 to 1.
  • Figure 5: Aspect-aware summarization.
  • ...and 6 more figures