Table of Contents
Fetching ...

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang

TL;DR

DeepReview tackles the reliability and scalability challenges of using LLMs for automated manuscript review by introducing a structured, multi-stage reasoning framework. It builds a richly annotated DeepReview-13K dataset and a corresponding DeepReviewer-14B model, capable of Fast, Standard, and Best inference modes to balance efficiency and quality. The approach yields superior quantitative performance (lower MSE/MAE, higher ranking and selection scores) and stronger qualitative review quality, while demonstrating robustness to adversarial prompts and effective test-time scaling. Collectively, the work provides a practical, open-resource platform that advances LLM-assisted peer review while emphasizing responsible use and human-in-the-loop safeguards.

Abstract

Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.

DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process

TL;DR

DeepReview tackles the reliability and scalability challenges of using LLMs for automated manuscript review by introducing a structured, multi-stage reasoning framework. It builds a richly annotated DeepReview-13K dataset and a corresponding DeepReviewer-14B model, capable of Fast, Standard, and Best inference modes to balance efficiency and quality. The approach yields superior quantitative performance (lower MSE/MAE, higher ranking and selection scores) and stronger qualitative review quality, while demonstrating robustness to adversarial prompts and effective test-time scaling. Collectively, the work provides a practical, open-resource platform that advances LLM-assisted peer review while emphasizing responsible use and human-in-the-loop safeguards.

Abstract

Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.

Paper Structure

This paper contains 29 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of the DeepReviewer. (a) Input paper example with a real-world research paper. (b) Output example showing DeepReviewer's multi-stage reasoning process: Novelty Verification, Multi-dimension Review, and Reliability Verification. (c) Inference modes: fast, standard, and best, highlighting different reasoning paths. We provide a more detailed case study in the appendix \ref{['appendix:case']}.
  • Figure 2: Demonstrates the scoring comparison of AI Scientist and DeepReviewer 14B models under normal and attack scenarios. The DeepReviewer model shows the smallest increase in scores (the growth of red bars relative to blue bars in the graph) when under attack, indicating its stronger robustness.
  • Figure 3: The performance of the DeepReviewer model in the Test-Time Scaling experiment. The x-axis represents the number of Tokens generated during model inference, and the y-axis represents different evaluation metrics. The green and red dashed lines are linear regression fitting curves for Reasoning Path Scaling and Reviewer Scaling scaling methods, respectively.
  • Figure 4: System prompt used to guide Gemini-2.0-Thinking-Flask as Judge to evaluate generated review comments.
  • Figure 5: System prompt designed to instruct the LLM on how to enhance and improve the usefulness of original review comments by incorporating author responses and maintaining original review context.
  • ...and 7 more figures