Table of Contents
Fetching ...

ReviewScore: Misinformed Peer Review Detection with Large Language Models

Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim, Donghyeon Cho, Gyouk Chu, Minyeong Hwang, Hyeongwon Jang, Changhun Kim, Haechan Kim, Jina Kim, Joowon Kim, Yoonjeon Kim, Kwanhyung Lee, Chanjae Park, Heecheol Yun, Gregor Betz, Eunho Yang

Abstract

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs. The models show F1 scores of 0.4--0.5 and kappa scores of 0.3--0.4, indicating moderate agreement but also suggesting that fully automating the evaluation remains challenging. A thorough disagreement analysis reveals that most errors are due to models' incorrect reasoning. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality.

ReviewScore: Misinformed Peer Review Detection with Large Language Models

Abstract

Peer review serves as a backbone of academic research, but in most AI conferences, the review quality is degrading as the number of submissions explodes. To reliably detect low-quality reviews, we define misinformed review points as either "weaknesses" in a review that contain incorrect premises, or "questions" in a review that can be already answered by the paper. We verify that 15.2% of weaknesses and 26.4% of questions are misinformed and introduce ReviewScore indicating if a review point is misinformed. To evaluate the factuality of each premise of weaknesses, we propose an automated engine that reconstructs every explicit and implicit premise from a weakness. We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation. Then, we measure human-model agreements on ReviewScore using eight current state-of-the-art LLMs. The models show F1 scores of 0.4--0.5 and kappa scores of 0.3--0.4, indicating moderate agreement but also suggesting that fully automating the evaluation remains challenging. A thorough disagreement analysis reveals that most errors are due to models' incorrect reasoning. We also prove that evaluating premise-level factuality shows significantly higher agreements than evaluating weakness-level factuality.

Paper Structure

This paper contains 42 sections, 4 equations, 29 figures, 13 tables.

Figures (29)

  • Figure 1: Overview of ReviewScore. Each review point in a review is categorized into question and weakness. We further categorize weakness into claim and argument by the presence of supporting reasons. Based on an appropriate knowledge base, if a question is answerable by the paper, a claim is factually incorrect, or an argument contains factually incorrect premises, then the review point is misinformed. For arguments, to extract all explicit and implicit premises, we also introduce an automatic argument reconstruction engine.
  • Figure 2: (a) Overview of an automatic argument reconstruction. Given an argumentative review point with a paper, a model first generates a reconstructed argument (i.e., a set of premises and conclusion). To check its validity, a model translates a NL reconstructed argument into FOL formulas, and then a SAT solver judges if it is valid. To check its faithfulness, a model translate FOL formulas back into the NL domain, and a model judges if the reconstruction is faithful. If one of two criteria does not met, then corresponding NL feedback is given to the generator model. (b) A representative example. We sample a review point of vit and its reconstruction along with corresponding formulas and keys.
  • Figure 3: Types of human-model disagreements.
  • Figure 4: Score rubric for evaluating faithfulness of argument reconstruction by human annotators.
  • Figure 5: Example #1 of automatic argument reconstruction.
  • ...and 24 more figures

Theorems & Definitions (4)

  • Definition 1: Review Point
  • Definition 2: Misinformed Review Point
  • Definition 3: Base ReviewScore
  • Definition 4: Advanced ReviewScore