Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams
Adriana Caraeni, Alexander Scarlatos, Andrew Lan
TL;DR
The paper investigates automated grading of handwritten mathematics using GPT-4o, addressing the multimodal challenge of combining handwriting, diagrams, and mathematical reasoning. It uses real exam data from 18 students in a probability theory course and evaluates three prompting regimes—N (no context), C (with correct answers), and CR (with correct answers and rubrics)—across 90 question-samples normalized to a 0–1 scale. Results show that while including correct answers and rubrics improves alignment with human graders, GPT-4o's accuracy remains insufficient for real-world deployment, with notable variability across questions and clear failure modes in justifications. The work identifies avenues for improvement, including task decomposition, fine-tuning, and rubric redesign, to advance practical automated handwriting grading in STEM assessments.
Abstract
Recent advances in generative artificial intelligence (AI) have shown promise in accurately grading open-ended student responses. However, few prior works have explored grading handwritten responses due to a lack of data and the challenge of combining visual and textual information. In this work, we leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams. Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques. We find that while providing rubrics improves alignment, the model's overall accuracy is still too low for real-world settings, showing there is significant room for growth in this task.
