Table of Contents
Fetching ...

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu

TL;DR

MMReview introduces a multidisciplinary, multimodal benchmark for evaluating LLM-based peer review across 17 domains and four disciplines, incorporating text, figures/tables, and PDF-page images. It constructs a 240-sample gold standard with 13 tasks organized into stepwise, outcome-based, preference-based, and attack-based categories, enabling comprehensive assessment of reasoning, alignment, and robustness. Across 16 open-source and 5 closed-source models, the study finds that model scale and structured reasoning improve performance on key tasks, while multimodal inputs enhance robustness to prompt manipulation and reveal domain-specific strengths. MMReview aims to standardize automated peer-review development and provide actionable insights for deploying LLM-assisted review tools in practice.

Abstract

With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

TL;DR

MMReview introduces a multidisciplinary, multimodal benchmark for evaluating LLM-based peer review across 17 domains and four disciplines, incorporating text, figures/tables, and PDF-page images. It constructs a 240-sample gold standard with 13 tasks organized into stepwise, outcome-based, preference-based, and attack-based categories, enabling comprehensive assessment of reasoning, alignment, and robustness. Across 16 open-source and 5 closed-source models, the study finds that model scale and structured reasoning improve performance on key tasks, while multimodal inputs enhance robustness to prompt manipulation and reveal domain-specific strengths. MMReview aims to standardize automated peer-review development and provide actionable insights for deploying LLM-assisted review tools in practice.

Abstract

With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

Paper Structure

This paper contains 39 sections, 26 figures, 10 tables.

Figures (26)

  • Figure 1: The construction pipeline of MMReview. The construction pipeline is divided into three stages: data collection, data processing, and task construction.
  • Figure 2: The average scores under text-only input setting, with context length measured in tokens.
  • Figure 3: The average scores under pdf-as-image input setting, with context length measured in the number of images.
  • Figure 4: Result of Summary task in case 1.
  • Figure 5: Result of SE task in case 1.
  • ...and 21 more figures