Table of Contents
Fetching ...

Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

Gerd Kortemeyer, Alexander Caspar, Daria Horica

TL;DR

This study assesses AI-assisted grading of handwritten open-ended calculus within a human-in-the-loop framework. Using GPT-5 to score page-by-page student work against a TA rubric, the authors implement a partial-credit threshold and a 2PL IRT-based risk filter to decide auto-acceptance versus human review, revealing moderate unfiltered AI-TA agreement ($R^2\approx0.85$) and substantial gains in alignment when confidence filters are applied ($R^2$ rising toward $0.95$) at the cost of higher manual grading. The results show that production-ready automation can handle a sizable subset of routine items (up to ~81% auto-acceptance under permissive settings, ~30% under tighter settings), with key psychometric constraints stemming from low stakes on open-ended tasks and limited rubric granularity. Practical recommendations include increasing observable rubric checkpoints, improving spatial anchoring and page-layout design, and adjusting weighting to raise the ceiling of AI-assisted grading while preserving the integrity of students' mathematical reasoning. Overall, calibrated confidence and conservative routing enable scalable AI guidance in authentic calculus assessment, reserving expert judgment for pedagogically rich or ambiguous responses.

Abstract

We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.

Artificial-Intelligence Grading Assistance for Handwritten Components of a Calculus Exam

TL;DR

This study assesses AI-assisted grading of handwritten open-ended calculus within a human-in-the-loop framework. Using GPT-5 to score page-by-page student work against a TA rubric, the authors implement a partial-credit threshold and a 2PL IRT-based risk filter to decide auto-acceptance versus human review, revealing moderate unfiltered AI-TA agreement () and substantial gains in alignment when confidence filters are applied ( rising toward ) at the cost of higher manual grading. The results show that production-ready automation can handle a sizable subset of routine items (up to ~81% auto-acceptance under permissive settings, ~30% under tighter settings), with key psychometric constraints stemming from low stakes on open-ended tasks and limited rubric granularity. Practical recommendations include increasing observable rubric checkpoints, improving spatial anchoring and page-layout design, and adjusting weighting to raise the ceiling of AI-assisted grading while preserving the integrity of students' mathematical reasoning. Overall, calibrated confidence and conservative routing enable scalable AI guidance in authentic calculus assessment, reserving expert judgment for pedagogically rich or ambiguous responses.

Abstract

We investigate whether contemporary multimodal LLMs can assist with grading open-ended calculus at scale without eroding validity. In a large first-year exam, students' handwritten work was graded by GPT-5 against the same rubric used by teaching assistants (TAs), with fractional credit permitted; TA rubric decisions served as ground truth. We calibrated a human-in-the-loop filter that combines a partial-credit threshold with an Item Response Theory (2PL) risk measure based on the deviation between the AI score and the model-expected score for each student-item. Unfiltered AI-TA agreement was moderate, adequate for low-stakes feedback but not for high-stakes use. Confidence filtering made the workload-quality trade-off explicit: under stricter settings, AI delivered human-level accuracy, but also left roughly 70% of the items to be graded by humans. Psychometric patterns were constrained by low stakes on the open-ended portion, a small set of rubric checkpoints, and occasional misalignment between designated answer regions and where work appeared. Practical adjustments such as slightly higher weight and protected time, a few rubric-visible substeps, stronger spatial anchoring should raise ceiling performance. Overall, calibrated confidence and conservative routing enable AI to reliably handle a sizable subset of routine cases while reserving expert judgment for ambiguous or pedagogically rich responses.

Paper Structure

This paper contains 17 sections, 2 equations, 9 figures.

Figures (9)

  • Figure 1: An example of graded student work in the answer booklet.
  • Figure 2: An example of the input for the AI-system; grading marks were removed and two pages combined into one image. Potentially identifying information was redacted here for publication purposes (dark blue boxes).
  • Figure 3: An example of the grading rubric, provided in the same format to the TAs and the AI.
  • Figure 4: Total AI-assigned versus total TA-assigned score. Each data points represents one exam.
  • Figure 5: Graphs of the function Eq. \ref{['eq:irt']} (item characteristic curves) based on the AI grading (left panel), which will be used for the confidence filtering, and on the TA grading (right panel), given for comparison.
  • ...and 4 more figures