Table of Contents
Fetching ...

RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, Vasant Honavar

TL;DR

The paper tackles automated grading of mathematical proofs produced by LLMs, focusing on reliable partial-credit evaluation rather than binary correctness. It introduces RefGrader, an agentic workflow that extracts reference solutions and induces problem-specific rubrics to enable multi-step grading, and it demonstrates substantial gains over single-turn grading on IMO Shortlist and MathArena data. Key findings show that including reference solutions and rubric design improves partial-credit calibration and agreement with human judges, though design choices (e.g., milestone vs. approachability) interact nontrivially. The work provides practical implications for education and automated proof evaluation, and releases data, code, and prompts to foster future research and extensions.

Abstract

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

TL;DR

The paper tackles automated grading of mathematical proofs produced by LLMs, focusing on reliable partial-credit evaluation rather than binary correctness. It introduces RefGrader, an agentic workflow that extracts reference solutions and induces problem-specific rubrics to enable multi-step grading, and it demonstrates substantial gains over single-turn grading on IMO Shortlist and MathArena data. Key findings show that including reference solutions and rubric design improves partial-credit calibration and agreement with human judges, though design choices (e.g., milestone vs. approachability) interact nontrivially. The work provides practical implications for education and automated proof evaluation, and releases data, code, and prompts to foster future research and extensions.

Abstract

State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Dataset summaries and error analysis for the IMO Shortlist dataset
  • Figure 2: Grade distribution for the MathArena dataset
  • Figure 3: Normalized confusion matrices for single-turn grading on MathArena and IMO Shortlist.
  • Figure 4: The high-level schema of our multi-stage grading workflow
  • Figure 5: Workflow: reference solution clustering, solution matching, and grading.
  • ...and 2 more figures