ReviewScore: Misinformed Peer Review Detection with Large Language Models

Hyun Ryu; Doohyuk Jang; Hyemin S. Lee; Joonhyun Jeong; Gyeongman Kim; Donghyeon Cho; Gyouk Chu; Minyeong Hwang; Hyeongwon Jang; Changhun Kim; Haechan Kim; Jina Kim; Joowon Kim; Yoonjeon Kim; Kwanhyung Lee; Chanjae Park; Heecheol Yun; Gregor Betz; Eunho Yang

ReviewScore: Misinformed Peer Review Detection with Large Language Models

Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang

Abstract

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs. The models show F1 scores of 0.4--0.5 and kappa scores of 0.3--0.4, indicating moderate agreement but also suggesting that fully automating the evaluation remains challenging. A thorough disagreement analysis reveals that most errors are due to models' incorrect reasoning. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality.

ReviewScore: Misinformed Peer Review Detection with Large Language Models

Abstract

ReviewScore: Misinformed Peer Review Detection with Large Language Models

Abstract

Paper Structure

Table of Contents

Figures (29)

Theorems & Definitions (4)