Table of Contents
Fetching ...

Allocate Marginal Reviews to Borderline Papers Using LLM Comparative Ranking

Elliot L. Epstein, Rajat Dwaraknath, John Winnicki, Thanawat Sornwanee

TL;DR

This work proposes using LLM-based pairwise comparisons to generate a comparative ranking and identify a borderline band around the acceptance cutoff before human reviews begin. By reallocating marginal reviews to papers within this band, the method aims to improve decision accuracy without changing final human decisions; the approach relies on a Bradley–Terry model to derive paper scores and defines key metrics $\rho$ (borderline overlap) and $\Delta$ (marginal review value). Empirical analysis on 1,000 ICLR 2025 submissions provides retrospective estimates for $\rho$ and $\Delta$, with ablations showing robustness to band settings and input scope, and a formal cost-benefit framing $ (\rho s - s^2) N \Delta $. The results suggest modest but reliable gains in correct decisions under a fixed extra-review budget, offering a practical, low-risk way to focus reviewer effort where it matters most.

Abstract

This paper argues that large ML conferences should allocate marginal review capacity primarily to papers near the acceptance boundary, rather than spreading extra reviews via random or affinity-driven heuristics. We propose using LLM-based comparative ranking (via pairwise comparisons and a Bradley--Terry model) to identify a borderline band \emph{before} human reviewing and to allocate \emph{marginal} reviewer capacity at assignment time. Concretely, given a venue-specific minimum review target (e.g., 3 or 4), we use this signal to decide which papers receive one additional review (e.g., a 4th or 5th), without conditioning on any human reviews and without using LLM outputs for accept/reject. We provide a simple expected-impact calculation in terms of (i) the overlap between the predicted and true borderline sets ($ρ$) and (ii) the incremental value of an extra review near the boundary ($Δ$), and we provide retrospective proxies to estimate these quantities.

Allocate Marginal Reviews to Borderline Papers Using LLM Comparative Ranking

TL;DR

This work proposes using LLM-based pairwise comparisons to generate a comparative ranking and identify a borderline band around the acceptance cutoff before human reviews begin. By reallocating marginal reviews to papers within this band, the method aims to improve decision accuracy without changing final human decisions; the approach relies on a Bradley–Terry model to derive paper scores and defines key metrics (borderline overlap) and (marginal review value). Empirical analysis on 1,000 ICLR 2025 submissions provides retrospective estimates for and , with ablations showing robustness to band settings and input scope, and a formal cost-benefit framing . The results suggest modest but reliable gains in correct decisions under a fixed extra-review budget, offering a practical, low-risk way to focus reviewer effort where it matters most.

Abstract

This paper argues that large ML conferences should allocate marginal review capacity primarily to papers near the acceptance boundary, rather than spreading extra reviews via random or affinity-driven heuristics. We propose using LLM-based comparative ranking (via pairwise comparisons and a Bradley--Terry model) to identify a borderline band \emph{before} human reviewing and to allocate \emph{marginal} reviewer capacity at assignment time. Concretely, given a venue-specific minimum review target (e.g., 3 or 4), we use this signal to decide which papers receive one additional review (e.g., a 4th or 5th), without conditioning on any human reviews and without using LLM outputs for accept/reject. We provide a simple expected-impact calculation in terms of (i) the overlap between the predicted and true borderline sets () and (ii) the incremental value of an extra review near the boundary (), and we provide retrospective proxies to estimate these quantities.
Paper Structure (23 sections, 8 equations, 5 figures, 2 tables)

This paper contains 23 sections, 8 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Schematic of the marginal review allocation pipeline. Step 1 uses LLM pairwise comparisons to construct a comparative ranking and identify a borderline band around the acceptance percentile. Step 2 allocates marginal reviews to that band while keeping total reviewer load fixed, with numbers shown for illustration.
  • Figure 2: Sensitivity to marginal reviewer fraction.
  • Figure 3: Sensitivity to centering and marginal review value.
  • Figure 4: Toy Example of Acceptance Probability as a function of Underlying Quality: The black dotted line represents the switch from 0 probability to 1 probability of acceptance when the quality exceeds a certain threshold (target quantile). This is the first best behavior that could only be achieved when the underlying quality is known to us. Reviewer feedback is a noisy signal of the quality, leading to the black curve. Although having quality exceeding the threshold may no longer guarantee acceptance, higher quality leads to higher acceptance probability. The red curve represents the acceptance probability under our scheme when LLM has extremely high fidelity: if an author has an option to reduce one's quality, they could intentionally reduce their quality to increase the acceptance probability. However, we can see that, when an LLM has moderate noise, the acceptance probability follows the blue curve, which is monotone, rendering gaming impossible.
  • Figure 5: Ablations on information access and judge capability.