Table of Contents
Fetching ...

ReviewEval: An Evaluation Framework for AI-Generated Reviews

Madhav Krishan Garg, Tejash Prasad, Tanmay Singhal, Chhavi Kirtani, Murari Mandal, Dhruv Kumar

TL;DR

The paper addresses the challenge of scalable, high-quality peer review amid increasing submissions by introducing ReviewEval, a multi-dimensional evaluation framework, and ReviewAgent, an LLM-based reviewer with alignment and iterative refinement loops. It defines five evaluation dimensions—alignment with human reviews, factual accuracy, analytical depth, actionable insights, and adherence to reviewer guidelines—and integrates a self-refinement and external-improvement mechanism to enhance final reviews. Through experiments on 16 NeurIPS 2024 papers, the authors demonstrate that ReviewAgent achieves competitive or superior performance to expert reviews in actionable insights, factual correctness, and guideline adherence, while also improving topic coverage and depth in many configurations. The approach offers a transparent, metric-driven path toward more reliable AI-generated reviews with venue-specific tailoring, potentially reducing reviewer burden and accelerating scholarly discourse.

Abstract

The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: 1. ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and 2. ReviewAgent, an LLM-based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self-refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.

ReviewEval: An Evaluation Framework for AI-Generated Reviews

TL;DR

The paper addresses the challenge of scalable, high-quality peer review amid increasing submissions by introducing ReviewEval, a multi-dimensional evaluation framework, and ReviewAgent, an LLM-based reviewer with alignment and iterative refinement loops. It defines five evaluation dimensions—alignment with human reviews, factual accuracy, analytical depth, actionable insights, and adherence to reviewer guidelines—and integrates a self-refinement and external-improvement mechanism to enhance final reviews. Through experiments on 16 NeurIPS 2024 papers, the authors demonstrate that ReviewAgent achieves competitive or superior performance to expert reviews in actionable insights, factual correctness, and guideline adherence, while also improving topic coverage and depth in many configurations. The approach offers a transparent, metric-driven path toward more reliable AI-generated reviews with venue-specific tailoring, potentially reducing reviewer burden and accelerating scholarly discourse.

Abstract

The escalating volume of academic research, coupled with a shortage of qualified reviewers, necessitates innovative approaches to peer review. In this work, we propose: 1. ReviewEval, a comprehensive evaluation framework for AI-generated reviews that measures alignment with human assessments, verifies factual accuracy, assesses analytical depth, identifies degree of constructiveness and adherence to reviewer guidelines; and 2. ReviewAgent, an LLM-based review generation agent featuring a novel alignment mechanism to tailor feedback to target conferences and journals, along with a self-refinement loop that iteratively optimizes its intermediate outputs and an external improvement loop using ReviewEval to improve upon the final reviews. ReviewAgent improves actionable insights by 6.78% and 47.62% over existing AI baselines and expert reviews respectively. Further, it boosts analytical depth by 3.97% and 12.73%, enhances adherence to guidelines by 10.11% and 47.26% respectively. This paper establishes essential metrics for AIbased peer review and substantially enhances the reliability and impact of AI-generated reviews in academic research.

Paper Structure

This paper contains 21 sections, 16 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Examples of the challenges and limitations of AI based research paper reviews