Table of Contents
Fetching ...

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu

TL;DR

QEDBench is introduced, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria, and a critical reasoning gap in the discrete domain is uncovered.

Abstract

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

TL;DR

QEDBench is introduced, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria, and a critical reasoning gap in the discrete domain is uncovered.

Abstract

As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-undergraduate to early graduate level mathematics. To quantify this, we introduce QEDBench, the first large-scale dual-rubric alignment benchmark to systematically measure alignment with human experts on university-level math proofs by contrasting course-specific rubrics against expert common knowledge criteria. By deploying a dual-evaluation matrix (7 judges x 5 solvers) against 1,000+ hours of human evaluation, we reveal that certain frontier evaluators like Claude Opus 4.5, DeepSeek-V3, Qwen 2.5 Max, and Llama 4 Maverick exhibit significant positive bias (up to +0.18, +0.20, +0.30, +0.36 mean score inflation, respectively). Furthermore, we uncover a critical reasoning gap in the discrete domain: while Gemini 3.0 Pro achieves state-of-the-art performance (0.91 average human evaluation score), other reasoning models like GPT-5 Pro and Claude Sonnet 4.5 see their performance significantly degrade in discrete domains. Specifically, their average human evaluation scores drop to 0.72 and 0.63 in Discrete Math, and to 0.74 and 0.50 in Graph Theory. In addition to these research results, we also release QEDBench as a public benchmark for evaluating and improving AI judges. Our benchmark is publicly published at https://github.com/qqliu/Yale-QEDBench.
Paper Structure (74 sections, 21 figures, 4 tables)

This paper contains 74 sections, 21 figures, 4 tables.

Figures (21)

  • Figure 1: A side-by-side comparison of the 0.75 (Small Mistake) tier for the "Graph Expansion" problem. While both rubrics penalize similar errors, the Expert Rubric focuses on implicit logical gaps (e.g., monotonicity), while the Course Rubric focuses on explicit pedagogical derivations (e.g., deriving constraints).
  • Figure 2: Pass Rates by Discipline. The pass rates (score $\ge 0.9$) of frontier models across various mathematical disciplines. We observe that while models achieve high reliability in calculation-heavy domains like ODEs and Probability, performance drops significantly in structure-heavy domains such as Combinatorics and Graph Theory.
  • Figure 3: Average Score Distribution. The average evaluation scores (0.0--1.0) assigned by expert judges. Unlike the binary pass rate, this metric accounts for partial credit, revealing that models often demonstrate strong conceptual understanding (high partial scores) in domains like Analysis even when failing to produce fully rigorous proofs.
  • Figure 4: Evaluator Strictness. Comparison of pass rates ($\text{score} \ge 0.9$) across judges. While DeepSeek-V3 (72.2%) aligns most closely with the Human Consensus (67.7%), Llama 4 Maverick (90.2%) exhibits significant grade inflation.
  • Figure 5: Evaluator Bias Heatmap. The delta between AI and Human scores ($\Delta = S_{AI} - S_{Human}$). Positive values (red) denote score inflation (AI is more lenient), while negative values (blue) signify punitive bias (Human is more lenient). These systemic deltas highlight a bidirectional alignment gap: models tend to be more lenient in discrete domains while being more punitive in continuous analysis.
  • ...and 16 more figures