Table of Contents
Fetching ...

LLMs as Meta-Reviewers' Assistants: A Case Study

Eftekhar Hossain, Sanjeev Kumar Sinha, Naman Bansal, Alex Knipper, Souvika Sarkar, John Salvador, Yash Mahajan, Sri Guttikonda, Mousumi Akter, Md. Mahadi Hassan, Matthew Freestone, Matthew C. Williams, Dongji Feng, Santu Karmaker

TL;DR

This study investigates whether Large Language Models (GPT-3.5, PaLM2, and LLaMA2) can assist meta-reviewers by generating multi-perspective summaries (MPS) of reviewer opinions using TELeR-based prompting. Using 40 ICLR papers with associated reviews and handcrafted meta-reviews from OpenReview, the authors conduct a rigorous human evaluation (micro and macro) and an automatic GPT-4o-based assessment, revealing that GPT-3.5 and PaLM2 generally outperform LLaMA2 in manuscript-level judgments, while PaLM2 often yields higher recall and GPT-3.5 higher precision. However, automatic evaluation with GPT-4o shows limited alignment with human judgments for complex, aspect-aware tasks, raising concerns about relying on AI evaluators for such content. The results suggest LLMs can be useful assistants under careful prompt design (TELeR levels 3–4) and human verification, but model choice and evaluation methodology significantly affect reliability and usefulness in meta-review workflows.

Abstract

One of the most important yet onerous tasks in the academic peer-reviewing process is composing meta-reviews, which involves assimilating diverse opinions from multiple expert peers, formulating one's self-judgment as a senior expert, and then summarizing all these perspectives into a concise holistic overview to make an overall recommendation. This process is time-consuming and can be compromised by human factors like fatigue, inconsistency, missing tiny details, etc. Given the latest major developments in Large Language Models (LLMs), it is very compelling to rigorously study whether LLMs can help metareviewers perform this important task better. In this paper, we perform a case study with three popular LLMs, i.e., GPT-3.5, LLaMA2, and PaLM2, to assist meta-reviewers in better comprehending multiple experts perspectives by generating a controlled multi-perspective summary (MPS) of their opinions. To achieve this, we prompt three LLMs with different types/levels of prompts based on the recently proposed TELeR taxonomy. Finally, we perform a detailed qualitative study of the MPSs generated by the LLMs and report our findings.

LLMs as Meta-Reviewers' Assistants: A Case Study

TL;DR

This study investigates whether Large Language Models (GPT-3.5, PaLM2, and LLaMA2) can assist meta-reviewers by generating multi-perspective summaries (MPS) of reviewer opinions using TELeR-based prompting. Using 40 ICLR papers with associated reviews and handcrafted meta-reviews from OpenReview, the authors conduct a rigorous human evaluation (micro and macro) and an automatic GPT-4o-based assessment, revealing that GPT-3.5 and PaLM2 generally outperform LLaMA2 in manuscript-level judgments, while PaLM2 often yields higher recall and GPT-3.5 higher precision. However, automatic evaluation with GPT-4o shows limited alignment with human judgments for complex, aspect-aware tasks, raising concerns about relying on AI evaluators for such content. The results suggest LLMs can be useful assistants under careful prompt design (TELeR levels 3–4) and human verification, but model choice and evaluation methodology significantly affect reliability and usefulness in meta-review workflows.

Abstract

One of the most important yet onerous tasks in the academic peer-reviewing process is composing meta-reviews, which involves assimilating diverse opinions from multiple expert peers, formulating one's self-judgment as a senior expert, and then summarizing all these perspectives into a concise holistic overview to make an overall recommendation. This process is time-consuming and can be compromised by human factors like fatigue, inconsistency, missing tiny details, etc. Given the latest major developments in Large Language Models (LLMs), it is very compelling to rigorously study whether LLMs can help metareviewers perform this important task better. In this paper, we perform a case study with three popular LLMs, i.e., GPT-3.5, LLaMA2, and PaLM2, to assist meta-reviewers in better comprehending multiple experts perspectives by generating a controlled multi-perspective summary (MPS) of their opinions. To achieve this, we prompt three LLMs with different types/levels of prompts based on the recently proposed TELeR taxonomy. Finally, we perform a detailed qualitative study of the MPSs generated by the LLMs and report our findings.
Paper Structure (25 sections, 9 figures, 6 tables)

This paper contains 25 sections, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Core Contributions Ratings - rated separately across different Prompt Levels and different LLMs. Here, SA: Strongly Agree, A: Agree, N: Neutral, D: Disagree, SD: Strongly Disagree, P: Precision, and R: Recall.
  • Figure 5: Pearson correlation between human and GPT-4 evaluations across different aspects of Micro Evaluation (Core Contribution-CC, Common Strengths-CS, Common Weaknesses-CW, Literature Review Quality or Missing References-MR, Suggestions for Improvement-SI). Here, Px indicates the prompt levels.
  • Figure 6: Count Distribution of Human (H) vs. GPT-4 (G) evaluation scores on GPT 3.5 generated MPS for higher prompt level (level 3 and 4.). Here, 'HG=' indicates the human and GPT-4 give the same scores, 'G$>$H' indicates GPT provides a higher score than humans, and 'G$=$x+H' indicates GPT provides x points more score than human.
  • Figure 7: TELeR Taxonomy for prompting LLMs to perform complex tasks. For details, see santu2023teler.
  • Figure 8: Core Contributions Ratings - Prompt TELeR Level 1-4.
  • ...and 4 more figures