Table of Contents
Fetching ...

Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI

Wenqing Wu, Haixu Xi, Chengzhi Zhang

TL;DR

This paper investigates whether reviewer confidence scores align with the textual content of their reviews in AI conference data. It introduces a fine-grained framework to analyze consistency at word, sentence, and aspect levels by detecting hedge sentences and annotating review aspects, using data from OpenReview and NLPEER. The study finds a high level of text-score consistency across granularity levels and a negative relation between confidence scores and paper acceptance, supporting the reliability and fairness of current peer review practices. The work contributes a scalable data-and-code release and suggests directions for broader domain coverage and advanced hedging detection to enhance transparency in peer review.

Abstract

Peer review is vital in academia for evaluating research quality. Top AI conferences use reviewer confidence scores to ensure review reliability, but existing studies lack fine-grained analysis of text-score consistency, potentially missing key details. This work assesses consistency at word, sentence, and aspect levels using deep learning and NLP conference review data. We employ deep learning to detect hedge sentences and aspects, then analyze report length, hedge word/sentence frequency, aspect mentions, and sentiment to evaluate text-score alignment. Correlation, significance, and regression tests examine confidence scores' impact on paper outcomes. Results show high text-score consistency across all levels, with regression revealing higher confidence scores correlate with paper rejection, validating expert assessments and peer review fairness.

Are the confidence scores of reviewers consistent with the review content? Evidence from top conference proceedings in AI

TL;DR

This paper investigates whether reviewer confidence scores align with the textual content of their reviews in AI conference data. It introduces a fine-grained framework to analyze consistency at word, sentence, and aspect levels by detecting hedge sentences and annotating review aspects, using data from OpenReview and NLPEER. The study finds a high level of text-score consistency across granularity levels and a negative relation between confidence scores and paper acceptance, supporting the reliability and fairness of current peer review practices. The work contributes a scalable data-and-code release and suggests directions for broader domain coverage and advanced hedging detection to enhance transparency in peer review.

Abstract

Peer review is vital in academia for evaluating research quality. Top AI conferences use reviewer confidence scores to ensure review reliability, but existing studies lack fine-grained analysis of text-score consistency, potentially missing key details. This work assesses consistency at word, sentence, and aspect levels using deep learning and NLP conference review data. We employ deep learning to detect hedge sentences and aspects, then analyze report length, hedge word/sentence frequency, aspect mentions, and sentiment to evaluate text-score alignment. Correlation, significance, and regression tests examine confidence scores' impact on paper outcomes. Results show high text-score consistency across all levels, with regression revealing higher confidence scores correlate with paper rejection, validating expert assessments and peer review fairness.

Paper Structure

This paper contains 18 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: An Example of high confidence score but expressed doubts. (https://openreview.net/forum?id=Mos9F9kDwkz)
  • Figure 2: An example of a low confidence score but provided detailed explanations. (https://openreview.net/forum?id=PULSD5qI2N1)
  • Figure 3: Overview of hedge sentence prediction model.