DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process
Minjun Zhu, Yixuan Weng, Linyi Yang, Yue Zhang
TL;DR
DeepReview tackles the reliability and scalability challenges of using LLMs for automated manuscript review by introducing a structured, multi-stage reasoning framework. It builds a richly annotated DeepReview-13K dataset and a corresponding DeepReviewer-14B model, capable of Fast, Standard, and Best inference modes to balance efficiency and quality. The approach yields superior quantitative performance (lower MSE/MAE, higher ranking and selection scores) and stronger qualitative review quality, while demonstrating robustness to adversarial prompts and effective test-time scaling. Collectively, the work provides a practical, open-resource platform that advances LLM-assisted peer review while emphasizing responsible use and human-in-the-loop safeguards.
Abstract
Large Language Models (LLMs) are increasingly utilized in scientific research assessment, particularly in automated paper review. However, existing LLM-based review systems face significant challenges, including limited domain expertise, hallucinated reasoning, and a lack of structured evaluation. To address these limitations, we introduce DeepReview, a multi-stage framework designed to emulate expert reviewers by incorporating structured analysis, literature retrieval, and evidence-based argumentation. Using DeepReview-13K, a curated dataset with structured annotations, we train DeepReviewer-14B, which outperforms CycleReviewer-70B with fewer tokens. In its best mode, DeepReviewer-14B achieves win rates of 88.21\% and 80.20\% against GPT-o1 and DeepSeek-R1 in evaluations. Our work sets a new benchmark for LLM-based paper review, with all resources publicly available. The code, model, dataset and demo have be released in http://ai-researcher.net.
