ReviewEval: An Evaluation Framework for AI-Generated Reviews
Madhav Krishan Garg, Tejash Prasad, Tanmay Singhal, Chhavi Kirtani, Murari Mandal, Dhruv Kumar
TL;DR
The paper addresses the challenge of scalable, high-quality peer review amid increasing submissions by introducing ReviewEval, a multi-dimensional evaluation framework, and ReviewAgent, an LLM-based reviewer with alignment and iterative refinement loops. It defines five evaluation dimensions—alignment with human reviews, factual accuracy, analytical depth, actionable insights, and adherence to reviewer guidelines—and integrates a self-refinement and external-improvement mechanism to enhance final reviews. Through experiments on 16 NeurIPS 2024 papers, the authors demonstrate that ReviewAgent achieves competitive or superior performance to expert reviews in actionable insights, factual correctness, and guideline adherence, while also improving topic coverage and depth in many configurations. The approach offers a transparent, metric-driven path toward more reliable AI-generated reviews with venue-specific tailoring, potentially reducing reviewer burden and accelerating scholarly discourse.
Abstract
The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: 1. ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and 2. ReviewAgent, an LLM-based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self-refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.
