Table of Contents
Fetching ...

Cross-Model Disagreement as a Label-Free Correctness Signal

Matt Gorbett, Suman Jana

Abstract

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

Cross-Model Disagreement as a Label-Free Correctness Signal

Abstract

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

Paper Structure

This paper contains 28 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Cross-model disagreement as a label-free correctness indicator. Given a prompt $x$, the generator (Llama-3-8B) produces an answer $\hat{y}$. The verifier (Qwen2.5-7B) performs a single forward pass over $(x, \hat{y})$ with no generation required. The verifier assigns low probability to the token "1942", the generator's confident but incorrect answer. Cross-model perplexity () aggregates this surprise signal into a single correctness indicator, and high flags a likely error.
  • Figure 2: and performance across datasets compared to baselines.(A) Mean AUROC over all model pairs. G-Ent and G-PPL measure the generator's own entropy and perplexity; CME and CMP measure the corresponding signals from a verifier model on the generator's answer. APGR measures the fraction of the performance gap between weak and strong model that is recovered by routing, normalized so that random routing scores 0 and oracle routing scores 1; pairs with small gaps are excluded because the small denominator in the PGR formula produces unstable estimates. Error bars show standard error across pairs; $n$ indicates the number of pairs per dataset.
  • Figure 3: Top row: per-case signal means. Mean (left axis, blue) and G-Ent (right axis, red) by outcome category. The shaded column highlights the "generator wrong only" case---confident errors the verifier does not share. On MMLU, spikes $9\times$ above the mean of the other three cases while G-Ent is flat ($1.0\times$); on TriviaQA the spike is $10\times$ vs. $1.6\times$; on GSM8K both signals rise modestly ($2\times$ and $1.4\times$), reflecting the difficulty of isolating chain-of-thought errors with token-level signals. Bottom row: accuracy by signal quintile. Samples sorted by signal strength (Q1 = lowest, Q5 = highest); bars show weak model accuracy within each bin. On MMLU, produces a 74pp spread versus 24pp for G-Ent and 20pp for G-PPL. On TriviaQA (no context), all three signals are competitive (94pp, 81pp, 80pp). On GSM8K, achieves a 53pp spread while G-PPL nearly collapses to 4pp, confirming that generator self-perplexity is uninformative on chain-of-thought tasks and that cross-model disagreement is doing genuine work.
  • Figure 4: AUROC versus capability gap across three benchmarks. On MMLU, AUROC is uncorrelated with gap ($\rho = +0.11$, $p = 0.72$) and same-size cross-family pairs (blue) achieve the highest scores, suggesting model diversity drives the signal rather than capability asymmetry. On TriviaQA, gap correlates positively with AUROC ($\rho = +0.60$, $p = 0.04$), indicating a stronger verifier helps when errors are knowledge-driven. GSM8K shows no significant trend ($\rho = -0.11$, $p = 0.40$).
  • Figure 5: Coverage--accuracy curves (single pair). At each coverage level we abstain on instances whose signal score exceeds a swept threshold and report generator accuracy on the retained subset. CMP (blue, solid) maintains higher accuracy than G-Ent (red, dashed) across nearly all operating points on both MMLU (Qwen-0.5B $\to$ Qwen-7B) and GSM8K (Mistral-7B $\to$ Llama-3-8B). The dotted line marks full-set accuracy.
  • ...and 3 more figures