Table of Contents
Fetching ...

ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

Gaurav Sahu, Hugo Larochelle, Laurent Charlin, Christopher Pal

TL;DR

ReviewerToo presents a modular, socio-technical framework for AI-assisted peer review that models diverse reviewer personas and a metareviewing layer to enhance consistency and fairness. On the ICLR-2k dataset, AI agents achieve near-human decisions for binary accept/reject tasks and produce structured, actionable feedback, with meta-review ensembles delivering the strongest performance. The work highlights where AI reviewers excel (fact-checking, literature coverage) and where they lag (methodological novelty, theory), and argues for hybrid pipelines that leave high-stakes judgments to humans. Practical guidelines emphasize ensembles, contextual conditioning, and quality-focused evaluation to scale peer review while maintaining quality and fairness as scientific publishing grows.

Abstract

Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.

ReviewerToo: Should AI Join The Program Committee? A Look At The Future of Peer Review

TL;DR

ReviewerToo presents a modular, socio-technical framework for AI-assisted peer review that models diverse reviewer personas and a metareviewing layer to enhance consistency and fairness. On the ICLR-2k dataset, AI agents achieve near-human decisions for binary accept/reject tasks and produce structured, actionable feedback, with meta-review ensembles delivering the strongest performance. The work highlights where AI reviewers excel (fact-checking, literature coverage) and where they lag (methodological novelty, theory), and argues for hybrid pipelines that leave high-stakes judgments to humans. Practical guidelines emphasize ensembles, contextual conditioning, and quality-focused evaluation to scale peer review while maintaining quality and fairness as scientific publishing grows.

Abstract

Peer review is the cornerstone of scientific publishing, yet it suffers from inconsistencies, reviewer subjectivity, and scalability challenges. We introduce ReviewerToo, a modular framework for studying and deploying AI-assisted peer review to complement human judgment with systematic and consistent assessments. ReviewerToo supports systematic experiments with specialized reviewer personas and structured evaluation criteria, and can be partially or fully integrated into real conference workflows. We validate ReviewerToo on a carefully curated dataset of 1,963 paper submissions from ICLR 2025, where our experiments with the gpt-oss-120b model achieves 81.8% accuracy for the task of categorizing a paper as accept/reject compared to 83.9% for the average human reviewer. Additionally, ReviewerToo-generated reviews are rated as higher quality than the human average by an LLM judge, though still trailing the strongest expert contributions. Our analysis highlights domains where AI reviewers excel (e.g., fact-checking, literature coverage) and where they struggle (e.g., assessing methodological novelty and theoretical contributions), underscoring the continued need for human expertise. Based on these findings, we propose guidelines for integrating AI into peer-review pipelines, showing how AI can enhance consistency, coverage, and fairness while leaving complex evaluative judgments to domain experts. Our work provides a foundation for systematic, hybrid peer-review systems that scale with the growth of scientific publishing.

Paper Structure

This paper contains 82 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Performance of Different Reviewers on the ICLR-2k dataset.
  • Figure 2:
  • Figure 3: Confusion Matrices for binary Classification Task
  • Figure 4: Pairwise Cohen's $\kappa$ for different types of reviewers
  • Figure 5: Confusion Matrices for binary Classification Task Post-Discussion
  • ...and 2 more figures