Table of Contents
Fetching ...

REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context

Pawin Taechoyotin, Daniel E. Acuna

Abstract

Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.

REM-CTX: Automated Peer Review via Reinforcement Learning with Auxiliary Context

Abstract

Most automated peer review systems rely on textual manuscript content alone, leaving visual elements such as figures and external scholarly signals underutilized. We introduce REM-CTX, a reinforcement-learning system that incorporates auxiliary context into the review generation process via correspondence-aware reward functions. REM-CTX trains an 8B-parameter language model with Group Relative Policy Optimization (GRPO) and combines a multi-aspect quality reward with two correspondence rewards that explicitly encourage alignment with auxiliary context. Experiments on manuscripts across Computer, Biological, and Physical Sciences show that REM-CTX achieves the highest overall review quality among six baselines, outperforming other systems with substantially larger commercial models, and surpassing the next-best RL baseline across both quality and contextual grounding metrics. Ablation studies confirm that the two correspondence rewards are complementary: each selectively improves its targeted correspondence reward while preserving all quality dimensions, and the full model outperforms all partial variants. Analysis of training dynamics reveals that the criticism aspect is negatively correlated with other metrics during training, suggesting that future studies should group multi-dimension rewards for review generation.

Paper Structure

This paper contains 32 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Comparison of scientific review generation models. (a) Vanilla or simple prompting-based review generation models, which rely primarily on textual manuscript content and internal model knowledge. (b) Structured, prompting-based review generation model liang2024useful. (c) A multi-agent review generation model that incorporates agents to analyze the manuscript from different aspects darcy2024marg. (d) REMOR: A reinforcement learning-based review generation model that optimizes review quality using reward functions based the manuscript text only taechoyotin2025remor. (e) Our proposed model, REM-CTX, which combines auxiliary context with reinforcement-learning optimization via GRPO and correspondence-aware reward functions, produces more grounded and informative peer reviews.
  • Figure 2: Figure Correspondence Reward Function (FCRF) and the Novelty Correspondence Reward Function (NCRF) datasets and model construction. (a) Sentences from human reviews are paired with auxiliary context (figure details or novelty assessments), and each pair is labeled by an LLM along two axes: relevance and consistency. (b) A ModernBERT-based classifier is trained on these labels to score new sentence--context pairs.
  • Figure 3: The results of Vanilla (Sonnet 4.5), liang2024useful, MARG darcy2024marg, MAMORX taechoyotin2024mamorx, Qwen3-8B, REMOR taechoyotin2025remor, and REM-CTX performance across overall review quality (Total Aspect Coverage), dimension coverage, and correspondence reward functions. REM-CTX (TRC) is the score when the thinking traces are included in the evaluation. (a) Dimension and Correspondence scores across models. (b) Overall review quality scores are based on a composite of multiple aspect-specific metrics.
  • Figure 4: Correlation scores of dimension and correspondence scores calculated from a standardized reward value across training epochs (See Appendix \ref{['app:std-learning-curve']}). This analysis reveals that criticism is negatively correlated with the novelty correspondence score and praise. The presentation & reporting is negatively correlated with materials & methods. The presentation & reporting is negatively correlated with the figure correspondence score.
  • Figure 5: REM-CTX scores across Computer Science ($n{=}130$), Biological Science ($n{=}80$), and Physical Science ($n{=}24$). (a) Per-dimension scores. (b) Overall quality. Computer Science articles receive significantly higher quality scores than Biological Sciences ($p < 0.01$). (c) The scores for each minor discipline are within margins of error. Given that all minor disciplines have only 4 papers, except Computer Science, which has 130 papers. Ultimately, this plot suggests that REM-CTX favors all minor disciplines equally.
  • ...and 1 more figures