Table of Contents
Fetching ...

GPT4 is Slightly Helpful for Peer-Review Assistance: A Pilot Study

Zachary Robertson

TL;DR

This pilot study investigates GPT-4 as a peer-review assistant by generating reviews and comparing them to human reviews in a NeurIPS-style setting with 10 author-participants. It uses a structured GPT Generation pipeline, NeurIPS-style formatting, and a 1–5 helpfulness scale, alongside adversarial robustness tests. Results show comparable average helpfulness between GPT and human reviews but higher variance for AI-generated reviews, with GPT tending to summarize content and humans focusing on detailed critique. Adversarial tests reveal context-size effects and areas where GPT struggles, underscoring the need for larger-scale studies and careful integration into the review process. Overall, the work suggests AI-assisted peer review could help address resource constraints while highlighting important ethical and methodological considerations for deployment.

Abstract

In this pilot study, we investigate the use of GPT4 to assist in the peer-review process. Our key hypothesis was that GPT-generated reviews could achieve comparable helpfulness to human reviewers. By comparing reviews generated by both human reviewers and GPT models for academic papers submitted to a major machine learning conference, we provide initial evidence that artificial intelligence can contribute effectively to the peer-review process. We also perform robustness experiments with inserted errors to understand which parts of the paper the model tends to focus on. Our findings open new avenues for leveraging machine learning tools to address resource constraints in peer review. The results also shed light on potential enhancements to the review process and lay the groundwork for further research on scaling oversight in a domain where human-feedback is increasingly a scarce resource.

GPT4 is Slightly Helpful for Peer-Review Assistance: A Pilot Study

TL;DR

This pilot study investigates GPT-4 as a peer-review assistant by generating reviews and comparing them to human reviews in a NeurIPS-style setting with 10 author-participants. It uses a structured GPT Generation pipeline, NeurIPS-style formatting, and a 1–5 helpfulness scale, alongside adversarial robustness tests. Results show comparable average helpfulness between GPT and human reviews but higher variance for AI-generated reviews, with GPT tending to summarize content and humans focusing on detailed critique. Adversarial tests reveal context-size effects and areas where GPT struggles, underscoring the need for larger-scale studies and careful integration into the review process. Overall, the work suggests AI-assisted peer review could help address resource constraints while highlighting important ethical and methodological considerations for deployment.

Abstract

In this pilot study, we investigate the use of GPT4 to assist in the peer-review process. Our key hypothesis was that GPT-generated reviews could achieve comparable helpfulness to human reviewers. By comparing reviews generated by both human reviewers and GPT models for academic papers submitted to a major machine learning conference, we provide initial evidence that artificial intelligence can contribute effectively to the peer-review process. We also perform robustness experiments with inserted errors to understand which parts of the paper the model tends to focus on. Our findings open new avenues for leveraging machine learning tools to address resource constraints in peer review. The results also shed light on potential enhancements to the review process and lay the groundwork for further research on scaling oversight in a domain where human-feedback is increasingly a scarce resource.
Paper Structure (19 sections, 1 figure, 2 tables)

This paper contains 19 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Mean Helpfulness Ratings of GPT and Human Reviews. The bar chart illustrates the mean helpfulness ratings for both GPT-generated and human reviews, which both stand at approximately 3 on a scale of 1 to 5. The error bars represent the 95 % confidence interval, highlighting the variability in ratings for each type of review. Notably, GPT reviews exhibited a larger variance in helpfulness (3 ± 0.96), compared to human reviews (3.1 ± 0.57).