Table of Contents
Fetching ...

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Sher Badshah, Ali Emami, Hassan Sajjad

TL;DR

The paper addresses the reliability gap in LLM-based pairwise judging by introducing Scope, a selective conformal framework that guarantees, under exchangeability, that the error rate among accepted judgments does not exceed a user-specified level $\alpha$. It partners this with Bidirectional Preference Entropy (BPE) to derive bias-neutral uncertainty scores by evaluating each pair in both orders and aggregating the results into a permutation-invariant measure. Empirically, Scope achieves valid risk control with substantially higher coverage across MT-Bench, RewardBench, and Chatbot Arena, using model scales from $7$B to $70$B parameters; BPE improves calibration and discrimination versus standard proxies and Simulated Annotators. The work demonstrates that combining bias-aware uncertainty estimation with conformal risk control yields reliable, scalable LLM-based evaluation, offering a principled path toward trustworthy automated benchmarking and alignment workflows, with potential extensions to richer evaluation settings.

Abstract

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.

SCOPE: Selective Conformal Optimized Pairwise LLM Judging

TL;DR

The paper addresses the reliability gap in LLM-based pairwise judging by introducing Scope, a selective conformal framework that guarantees, under exchangeability, that the error rate among accepted judgments does not exceed a user-specified level . It partners this with Bidirectional Preference Entropy (BPE) to derive bias-neutral uncertainty scores by evaluating each pair in both orders and aggregating the results into a permutation-invariant measure. Empirically, Scope achieves valid risk control with substantially higher coverage across MT-Bench, RewardBench, and Chatbot Arena, using model scales from B to B parameters; BPE improves calibration and discrimination versus standard proxies and Simulated Annotators. The work demonstrates that combining bias-aware uncertainty estimation with conformal risk control yields reliable, scalable LLM-based evaluation, offering a principled path toward trustworthy automated benchmarking and alignment workflows, with potential extensions to richer evaluation settings.

Abstract

Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level . To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at , \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk to ), while retaining substantial coverage, reaching on RewardBench with Qwen-14B and on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
Paper Structure (43 sections, 1 theorem, 24 equations, 6 figures, 3 tables)

This paper contains 43 sections, 1 theorem, 24 equations, 6 figures, 3 tables.

Key Result

Theorem 2.1

Let calibration and test samples be exchangeable angelopoulos2023gentle. For any $\alpha \in (0, 1)$, the threshold $\hat{\lambda}$ derived in Eq. eq:scope_opt guarantees that the marginal test-time FDR satisfies: where the expectation is taken over the joint randomness of the calibration set and the test sample.

Figures (6)

  • Figure 1: Overview of the SCOPE Framework. (1) Pairwise Judging: An LLM judge evaluates two responses ($r_A, r_B$) for a user query $q$. (2) Bidirectional Preference Entropy (BPE): To neutralize position bias, the judge evaluates the pair in both forward and reverse orders. The probabilities are aggregated into a bias-neutral preference $\bar{p}$ and converted into an entropy-based uncertainty score $s(x)$. (3) SCOPE: The user sets a target risk level $\alpha$ (e.g., 0.10). Using conformal calibration on labeled data, the system calculates an optimized threshold $\hat{\lambda}$. If the uncertainty $s(x) \leq \hat{\lambda}$, the judgment is accepted with a statistical guarantee that the error rate is controlled at $\alpha$.
  • Figure 2: Coverage vs. target risk level $\alpha$ for Scope. Coverage increases as the risk budget is relaxed, and larger judges sustain higher coverage at strict tolerances.
  • Figure 3: Statistical validity of Scope across benchmarks. We report the empirical risk (FDR) against the user-specified target risk level $\alpha$. The dashed diagonal line ($y=x$) indicates the theoretical safety limit; curves remaining below this boundary demonstrate valid risk control. Solid lines represent the mean risk over $1000$ trials, while shaded regions denote the standard deviation ($\pm 1\sigma$). Scope consistently satisfies the risk constraint across judges and tasks.
  • Figure 4: Pairwise evaluation prompt. The system instruction used for all judge models. Note that instructions requesting an explanation/reasoning trace were removed to enable direct logit extraction (or greedy decoding) of the preference token.
  • Figure 5: Verbalized confidence prompt. As detailed in Appendix B.3.3, the instruction "Provide a score between 0.0 (total guess) and 1.0 (absolute certainty)" is appended to the standard pairwise evaluation prompt to elicit a numerical confidence estimate.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 2.1
  • proof