Table of Contents
Fetching ...

Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

Arne Vanhoyweghen, Vincent Holst, Melika Mobini, Lukas Van de Voorde, Tibo Vanleke, Bert Verbruggen, Brecht Verbeken, Andres Algaba, Sam Verboven, Marie-Anne Guerry, Filip Van Droogenbroeck, Vincent Ginis

Abstract

Providing timely and individualised feedback on handwritten student work is highly beneficial for learning but difficult to achieve at scale. This challenge has become more pressing as generative AI undermines the reliability of take-home assessments, shifting emphasis toward supervised, in-class evaluation. We present a scalable, end-to-end workflow for LLM-assisted grading of short, pen-and-paper assessments. The workflow spans (1) constructing solution keys, (2) developing detailed rubric-style grading keys used to guide the LLM, and (3) a grading procedure that combines automated scanning and anonymisation, multi-pass LLM scoring, automated consistency checks, and mandatory human verification. We deploy the system in two undergraduate mathematics courses using six low-stakes in-class tests. Empirically, LLM assistance reduces grading time by approximately 23% while achieving agreement comparable to, and in several cases tighter than, fully manual grading. Occasional model errors occur but are effectively contained by the hybrid design. Overall, our results show that carefully embedded human-in-the-loop LLM grading can substantially reduce workload while maintaining fairness and accuracy.

Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

Abstract

Providing timely and individualised feedback on handwritten student work is highly beneficial for learning but difficult to achieve at scale. This challenge has become more pressing as generative AI undermines the reliability of take-home assessments, shifting emphasis toward supervised, in-class evaluation. We present a scalable, end-to-end workflow for LLM-assisted grading of short, pen-and-paper assessments. The workflow spans (1) constructing solution keys, (2) developing detailed rubric-style grading keys used to guide the LLM, and (3) a grading procedure that combines automated scanning and anonymisation, multi-pass LLM scoring, automated consistency checks, and mandatory human verification. We deploy the system in two undergraduate mathematics courses using six low-stakes in-class tests. Empirically, LLM assistance reduces grading time by approximately 23% while achieving agreement comparable to, and in several cases tighter than, fully manual grading. Occasional model errors occur but are effectively contained by the hybrid design. Overall, our results show that carefully embedded human-in-the-loop LLM grading can substantially reduce workload while maintaining fairness and accuracy.
Paper Structure (19 sections, 9 figures, 3 tables)

This paper contains 19 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Example grading key highlighting two design issues: assigning too many points to a single step and using imprecise terminology (“factorise”), which can cause the model to award credit for mathematically irrelevant operations.
  • Figure 2: Quadratically weighted Cohen’s $\kappa$ values comparing grading agreement across media, tests, and question types. A1 and A2 denote the two human annotators (e.g., A1 vs A2 indicates inter-annotator agreement). Colours indicate the bonus test, while marker shapes denote the question type. Across all but one question–annotator pairing, annotator–LLM agreement in the digital condition is comparable to, and in several cases higher than, inter-annotator agreement under manual grading.
  • Figure 3: Distributions of absolute score deviations for all grader pairings under manual and digital grading. The top row shows manual grading comparisons (human vs. human, human vs. GPT using the median assigned score), while the bottom row shows digital grading comparisons. Dashed vertical lines indicate median deviations and solid lines indicate mean deviations. In the manual setting, mean and median deviations coincide for both human vs. human and human vs. GPT comparisons, indicating similar grading behaviour. In the digital setting, median deviations are lower than the mean for both human vs. human and human vs. GPT comparisons, consistent with an anchoring effect in which the LLM grade acts as a stabilising reference that reduces typical disagreement while leaving mean deviation largely unchanged.
  • Figure 4: Relative positioning of LLM-assigned grades with respect to the two human annotators. Bars indicate the proportion of responses for which the LLM score lies below both human scores, between the two human scores, is equal to both, or exceeds both. Error bars report mean $\pm$ standard deviation of absolute distance to the nearest human boundary. Across grading media, the LLM most frequently assigns intermediate or matching scores, with deviations outside the human range occurring less often.
  • Figure 5: Relative positioning of LLM-assigned grades with respect to the two human annotators when the maximum of five LLM evaluations is used as the provisional grade. Compared to median aggregation (main text), maximum-based aggregation produces more positive deviations outside the human score range.
  • ...and 4 more figures