Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Zhiqi Yu; Xingping Liu; Haobin Mao; Mingshuo Liu; Long Chen; Jack Xin; Yifeng Yu

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Zhiqi Yu, Xingping Liu, Haobin Mao, Mingshuo Liu, Long Chen, Jack Xin, Yifeng Yu

TL;DR

A standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research is outlined, and a multi-perspective evaluation protocol for reliable, real-course deployment is introduced.

Abstract

Grading in large undergraduate STEM courses often yields minimal feedback due to heavy instructional workloads. We present a large-scale empirical study of AI grading on real, handwritten single-variable calculus work from UC Irvine. Using OCR-conditioned large language models with structured, rubric-guided prompting, our system produces scores and formative feedback for thousands of free-response quiz submissions from nearly 800 students. In a setting with no single ground-truth label, we evaluate performance against official teaching-assistant grades, student surveys, and independent human review, finding strong alignment with TA scoring and a large majority of AI-generated feedback rated as correct or acceptable across quizzes. Beyond calculus, this setting highlights core challenges in OCR-conditioned mathematical reasoning and partial-credit assessment. We analyze key failure modes, propose practical rubric- and prompt-design principles, and introduce a multi-perspective evaluation protocol for reliable, real-course deployment. Building on the dataset and evaluation framework developed here, we outline a standardized benchmark for AI grading of handwritten mathematics to support reproducible comparison and future research.

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

TL;DR

Abstract

Paper Structure (47 sections, 28 equations, 10 figures, 5 tables)

This paper contains 47 sections, 28 equations, 10 figures, 5 tables.

Introduction and Motivation
Overview of the AI Grading Pipeline
Model choice and evolving capabilities.
OCR Component
Grading and Structured Prompt Engineering
System Message Design
Rubric and Prompt Framework
o3-mini vs. GPT-4.1-mini Grading Variance and Targeted Accuracy Evaluation
Evaluation Against Human Grading and Feedback
Alignment with TA Scores
Overall agreement (global)
Quiz-level summary
Survey Result From Students
Agreement Between AI and Independent Human Reviewer
Input Quality (OCR).
...and 32 more sections

Figures (10)

Figure 1: Overview of the proposed AI grading pipeline: The core module converts handwritten solutions to LaTeX and evaluates them against reference solutions and structured rubrics, producing detailed scores and feedback.
Figure 2: Bars correspond to the flexible rubric, fixed rubric and max-rule, respectively (left to right within each group): A-Math 2A, B-Math 2B
Figure 3: Global distribution of AI--TA score gaps.
Figure 4: Quiz-level summary: mean gap and within-1 percentage.
Figure 5: OCR verdict distribution: A-Math 2A, B-Math 2B
...and 5 more figures

Theorems & Definitions (8)

Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

TL;DR

Abstract

Evaluating AI Grading on Real-World Handwritten College Mathematics: A Large-Scale Study Toward a Benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (8)