Table of Contents
Fetching ...

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Zhiqi Yu, Xingping Liu, Haobin Mao, Mingshuo Liu, Long Chen, Jack Xin, Yifeng Yu

TL;DR

A standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research is outlined, and a multi-perspective evaluation protocol for reliable, real-course deployment is introduced.

Abstract

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

TL;DR

A standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research is outlined, and a multi-perspective evaluation protocol for reliable, real-course deployment is introduced.

Abstract

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.
Paper Structure (47 sections, 28 equations, 10 figures, 5 tables)

This paper contains 47 sections, 28 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of the proposed AI grading pipeline: The core module converts handwritten solutions to LaTeX and evaluates them against reference solutions and structured rubrics, producing detailed scores and feedback.
  • Figure 2: Bars correspond to the flexible rubric, fixed rubric and max-rule, respectively (left to right within each group): A-Math 2A, B-Math 2B
  • Figure 3: Global distribution of AI--TA score gaps.
  • Figure 4: Quiz-level summary: mean gap and within-1 percentage.
  • Figure 5: OCR verdict distribution: A-Math 2A, B-Math 2B
  • ...and 5 more figures

Theorems & Definitions (8)

  • Example 1
  • Example 2
  • Example 3
  • Example 4
  • Example 5
  • Example 6
  • Example 7
  • Example 8