Table of Contents
Fetching ...

RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance

Paulo Henrique Couto, Quang Phuoc Ho, Nageeta Kumari, Benedictus Kent Rachmat, Thanh Gia Hieu Khuong, Ihsan Ullah, Lisheng Sun-Hosoya

TL;DR

RelevAI-Reviewer tackles the problem of assessing survey-paper relevance to a given call-for-paper prompt by casting it as a four-class classification task. It introduces a large, reverse-prompt-engineered dataset (25,164 instances; 100,656 prompt–paper entries) and compares baseline approaches, including SentenceTransformers + traditional classifiers and a BERT-based end-to-end model. The study finds that BERT with thermometer encoding achieves the best ranking performance, with Kendall’s Tau approaching ~0.93–0.99 depending on encoding, and demonstrates favorable data efficiency. By launching an open Codabench benchmark, the authors enable community-driven evaluation and iteration, with potential to augment traditional peer review by providing fast, fair relevance judgements for survey papers.

Abstract

Recent advancements in Artificial Intelligence (AI), particularly the widespread adoption of Large Language Models (LLMs), have significantly enhanced text analysis capabilities. This technological evolution offers considerable promise for automating the review of scientific papers, a task traditionally managed through peer review by fellow researchers. Despite its critical role in maintaining research quality, the conventional peer-review process is often slow and subject to biases, potentially impeding the swift propagation of scientific knowledge. In this paper, we propose RelevAI-Reviewer, an automatic system that conceptualizes the task of survey paper review as a classification problem, aimed at assessing the relevance of a paper in relation to a specified prompt, analogous to a "call for papers". To address this, we introduce a novel dataset comprised of 25,164 instances. Each instance contains one prompt and four candidate papers, each varying in relevance to the prompt. The objective is to develop a machine learning (ML) model capable of determining the relevance of each paper and identifying the most pertinent one. We explore various baseline approaches, including traditional ML classifiers like Support Vector Machine (SVM) and advanced language models such as BERT. Preliminary findings indicate that the BERT-based end-to-end classifier surpasses other conventional ML methods in performance. We present this problem as a public challenge to foster engagement and interest in this area of research.

RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance

TL;DR

RelevAI-Reviewer tackles the problem of assessing survey-paper relevance to a given call-for-paper prompt by casting it as a four-class classification task. It introduces a large, reverse-prompt-engineered dataset (25,164 instances; 100,656 prompt–paper entries) and compares baseline approaches, including SentenceTransformers + traditional classifiers and a BERT-based end-to-end model. The study finds that BERT with thermometer encoding achieves the best ranking performance, with Kendall’s Tau approaching ~0.93–0.99 depending on encoding, and demonstrates favorable data efficiency. By launching an open Codabench benchmark, the authors enable community-driven evaluation and iteration, with potential to augment traditional peer review by providing fast, fair relevance judgements for survey papers.

Abstract

Recent advancements in Artificial Intelligence (AI), particularly the widespread adoption of Large Language Models (LLMs), have significantly enhanced text analysis capabilities. This technological evolution offers considerable promise for automating the review of scientific papers, a task traditionally managed through peer review by fellow researchers. Despite its critical role in maintaining research quality, the conventional peer-review process is often slow and subject to biases, potentially impeding the swift propagation of scientific knowledge. In this paper, we propose RelevAI-Reviewer, an automatic system that conceptualizes the task of survey paper review as a classification problem, aimed at assessing the relevance of a paper in relation to a specified prompt, analogous to a "call for papers". To address this, we introduce a novel dataset comprised of 25,164 instances. Each instance contains one prompt and four candidate papers, each varying in relevance to the prompt. The objective is to develop a machine learning (ML) model capable of determining the relevance of each paper and identifying the most pertinent one. We explore various baseline approaches, including traditional ML classifiers like Support Vector Machine (SVM) and advanced language models such as BERT. Preliminary findings indicate that the BERT-based end-to-end classifier surpasses other conventional ML methods in performance. We present this problem as a public challenge to foster engagement and interest in this area of research.
Paper Structure (35 sections, 9 figures, 4 tables)

This paper contains 35 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Distribution of cosine similarities between prompts and papers across four relevance categories using vectorized embeddings. Histograms were generated from similarity scores with the number of bins set to 20. Dashed lines represent the mean similarity score for each category.
  • Figure 2: SVC Performance with Different Training Sizes and Corresponding F1-scores.
  • Figure 3: Kendall's Tau: BERT One-hot, Thermometer, SVC with Varying Data Sizes
  • Figure 4: F1-score: BERT with One-hot encoded labels
  • Figure 5: F1-score: BERT with Thermometer encoded labels
  • ...and 4 more figures