Table of Contents
Fetching ...

Zero-shot reasoning for simulating scholarly peer-review

Khalid M. Saqr

TL;DR

The paper introduces xPeerd, a zero-shot reasoning framework for simulating scholarly peer review with normative constraints to ensure integrity, transparency, and contestability. It formalizes the reasoning as a constrained Bayesian-argumentation process with a Dung graph for critique attacks and supports, and deontic guards that enforce disclosure, citation verification, and human adjudication. Using a dataset of $352$ valid simulated reviews from $n = 500$ cases, it shows Revise decisions dominate across disciplines ($>50\%$), field-specific Reject rates up to $45\%$ in Health Sciences, and a stable evidence-anchoring compliance rate of $29\%$ across tasks and domains. The results position xPeerd as a reproducible, auditable benchmark tool for policy and governance in scholarly publishing, capable of auditing workflows and managing integrity risks in AI-assisted peer review.

Abstract

The scholarly publishing ecosystem faces a dual crisis of unmanageable submission volumes and unregulated AI, creating an urgent need for new governance models to safeguard scientific integrity. The traditional human-only peer review regime lacks a scalable, objective benchmark, making editorial processes opaque and difficult to audit. Here we investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports. Analyzing 352 peer-review simulation reports, we identify consistent system state indicators that demonstrate its reliability. First, the system is able to simulate calibrated editorial judgment, with 'Revise' decisions consistently forming the majority outcome (>50%) across all disciplines, while 'Reject' rates dynamically adapt to field-specific norms, rising to 45% in Health Sciences. Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate that remains invariant across diverse review tasks and scientific domains. These findings demonstrate a system that is predictably rule-bound, mitigating the stochasticity of generative AI. For the scientific community, this provides a transparent tool to ensure fairness; for publishing strategists, it offers a scalable instrument for auditing workflows, managing integrity risks, and implementing evidence-based governance. The framework repositions AI as an essential component of institutional accountability, providing the critical infrastructure to maintain trust in scholarly communication.

Zero-shot reasoning for simulating scholarly peer-review

TL;DR

The paper introduces xPeerd, a zero-shot reasoning framework for simulating scholarly peer review with normative constraints to ensure integrity, transparency, and contestability. It formalizes the reasoning as a constrained Bayesian-argumentation process with a Dung graph for critique attacks and supports, and deontic guards that enforce disclosure, citation verification, and human adjudication. Using a dataset of valid simulated reviews from cases, it shows Revise decisions dominate across disciplines (), field-specific Reject rates up to in Health Sciences, and a stable evidence-anchoring compliance rate of across tasks and domains. The results position xPeerd as a reproducible, auditable benchmark tool for policy and governance in scholarly publishing, capable of auditing workflows and managing integrity risks in AI-assisted peer review.

Abstract

The scholarly publishing ecosystem faces a dual crisis of unmanageable submission volumes and unregulated AI, creating an urgent need for new governance models to safeguard scientific integrity. The traditional human-only peer review regime lacks a scalable, objective benchmark, making editorial processes opaque and difficult to audit. Here we investigate a deterministic simulation framework that provides the first stable, evidence-based standard for evaluating AI-generated peer review reports. Analyzing 352 peer-review simulation reports, we identify consistent system state indicators that demonstrate its reliability. First, the system is able to simulate calibrated editorial judgment, with 'Revise' decisions consistently forming the majority outcome (>50%) across all disciplines, while 'Reject' rates dynamically adapt to field-specific norms, rising to 45% in Health Sciences. Second, it maintains unwavering procedural integrity, enforcing a stable 29% evidence-anchoring compliance rate that remains invariant across diverse review tasks and scientific domains. These findings demonstrate a system that is predictably rule-bound, mitigating the stochasticity of generative AI. For the scientific community, this provides a transparent tool to ensure fairness; for publishing strategists, it offers a scalable instrument for auditing workflows, managing integrity risks, and implementing evidence-based governance. The framework repositions AI as an essential component of institutional accountability, providing the critical infrastructure to maintain trust in scholarly communication.

Paper Structure

This paper contains 2 sections, 22 equations, 6 figures.

Table of Contents

  1. PRR.
  2. DBReviewSim.

Figures (6)

  • Figure 1: UML diagram of xPeerd axioms and computation flow. Structural classes (white), functions (yellow), and operations (blue) are distinguished, and arrows indicate the decision workflow.
  • Figure 2: ASJC Supergroup Classification and Confidence. (a) Classification counts of $n=352$ valid reports across the ASJC supergroups. (b) Distribution of classification confidence scores $\hat{p}_i$ with a critical threshold $\tau=0.20$ indicated by the dashed line.
  • Figure 3: Distribution of editorial decisions by ASJC supergroup. Proportions are shown for Reject (red), Revise (orange), and Accept (green) outcomes. For each supergroup $g$, probabilities $p(D \mid g)$ are estimated from the sample of $n=352$ valid reports.
  • Figure 4: Report length versus page anchor fraction. Each dot represents one review report ($n=352$). A regression line (red dashed) with shaded 95% confidence interval is overlaid. The inset reports Spearman’s correlation $\rho = 0.13$ with $p = 0.014$.
  • Figure 5: Total issues (major + minor) detected by review type. Violin plots show the distribution of issue counts for each review type with embedded quartiles and individual report values ($n=352$).
  • ...and 1 more figures