Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence
Jan Cvengros, Gerd Kortemeyer
TL;DR
The paper investigates AI-assisted grading of handwritten general chemistry exams by comparing AI scores to TA-derived ground truth using psychometric analyses. It demonstrates strong agreement for textual and reaction-question types but weaker reliability for numerical and graphical tasks, prompting a human-in-the-loop approach. Confidence-filtering strategies based on partial-credit thresholds and Bayesian risk, plus problem-type filtering, substantially improve alignment between AI and human grading, especially when totals are aggregated. The findings suggest a practical, layered workflow where AI handles routine items and humans focus on ambiguous or graphically intensive responses, with attention to transparency and student perceptions of fairness. Overall, AI grading can reduce workload while preserving grading quality when deployed with careful calibration and explicit communication.
Abstract
We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.
