Table of Contents
Fetching ...

Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence

Jan Cvengros, Gerd Kortemeyer

TL;DR

The paper investigates AI-assisted grading of handwritten general chemistry exams by comparing AI scores to TA-derived ground truth using psychometric analyses. It demonstrates strong agreement for textual and reaction-question types but weaker reliability for numerical and graphical tasks, prompting a human-in-the-loop approach. Confidence-filtering strategies based on partial-credit thresholds and Bayesian risk, plus problem-type filtering, substantially improve alignment between AI and human grading, especially when totals are aggregated. The findings suggest a practical, layered workflow where AI handles routine items and humans focus on ambiguous or graphically intensive responses, with attention to transparency and student perceptions of fairness. Overall, AI grading can reduce workload while preserving grading quality when deployed with careful calibration and explicit communication.

Abstract

We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.

Assisting the Grading of a Handwritten General Chemistry Exam with Artificial Intelligence

TL;DR

The paper investigates AI-assisted grading of handwritten general chemistry exams by comparing AI scores to TA-derived ground truth using psychometric analyses. It demonstrates strong agreement for textual and reaction-question types but weaker reliability for numerical and graphical tasks, prompting a human-in-the-loop approach. Confidence-filtering strategies based on partial-credit thresholds and Bayesian risk, plus problem-type filtering, substantially improve alignment between AI and human grading, especially when totals are aggregated. The findings suggest a practical, layered workflow where AI handles routine items and humans focus on ambiguous or graphically intensive responses, with attention to transparency and student perceptions of fairness. Overall, AI grading can reduce workload while preserving grading quality when deployed with careful calibration and explicit communication.

Abstract

We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.

Paper Structure

This paper contains 35 sections, 5 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Examples of student work. These would be uploaded as images to the model. The exam was given in German, the translation of the problems is: Consider the pink-colored complex ion $[\hbox{Ni}(\hbox{phen})_3]^{2+}$. The phen-ligand has the following structure: … Complete the following table: oxidation number of the metal, number of d-electrons of the metal ion, coordination number of the metal ion, number of ligands. Sketch the distribution of the d-electrons of the complex ion $[\hbox{Ni}(\hbox{phen})_3]^{2+}$ in the crystal field of its ligands. Determine the magnetic properties of $[\hbox{Ni}(\hbox{phen})_3]^{2+}$. Is the complex chiral? Provide the reasoning of your answer with an appropriate sketch. The student work on the right labelled the initial image of the structure, "2 free EP $\to$ bidentate ligand." The student's reasoning in the left panel is, "no, the two mirror images can be superimposed by rotation, even when the double bonds are taken into account;" the reasoning on the right is, after labelling the two sketches "mirror images," "impossible to superimpose! $\Rightarrow$ it is chiral!"
  • Figure 2: The rubric page associated with the student work in Fig. \ref{['fig:examples']}. The instructions are: +0.25 points for each identical answer to the table below. +1 point for following sketch (the arrows --- number, distribution, and orientation --- are most important; if the sketch is incorrect: +0.5 points for … (repeated for both configurations). +0.5 points for paramagnetic. +0.5 points for chiral; +0.5 points for the following sketch; +0.5 points for a mirror image of the above sketch.
  • Figure 3: Examples of the problem types on the exam. For the drawing problem: draw the Lewis formula of the following particles with correct geometry (halfway correct binding angles and spatial structure including all free electron pairs and formal charges), and provide the structure type and molecular structure. For the graphing problem: draw the concentration of the anion $\hbox{SaI}^{2-}$ into the above diagram; the student wrote, "black line with slope +2, +1, and 0." For the long answer: below you see the crystal structure of sodium hydride (light spheres: $\hbox{H}^-$, dark spheres $\hbox{Na}^+$). Describe the lattice arrangement with respect to the position of the individual ions;" the student wrote, "lattice structure reflects a typical NaCl type, the light spheres $\hbox{H}^-$ are located in the octahedral voids. On the other hand, the dark spheres, $\hbox{Na}^+$ cations, are cubic face-centered." For the multiple-choice problem, the scenario is, "to an identical mixture like in problem part a), 100 ml of water are added. How do the parameters in the table below change? Mark the correct answer." The choices are "decreases," "stays the same," and "increases" for each of the concentration of bromoethane, the overall reaction order, the rate constant, and the reaction rate. For the numerical problem: you have a sample of the nuclide $^{203}\hbox{Hg}$ with a mass of 20 mg. After how many days is only 1 mg of the original substance left? In the student response, "tagen" means days. For the reaction response: Permanganate ions $\hbox{MnO}_4^-$ oxidize hydrazine $\hbox{N}_2\hbox{H}_4$ to nitrogen in aqueous solution under basic conditions, forming manganese(IV) oxide. Formulate the stoichiometric reaction equation for this process. The short answer problem has already been discussed in Fig. \ref{['fig:examples']}. The symbolic problem reads, formulate the thermodynamic equilibrium constant $K$ for the formation of bromine chloride in accordance with the standard conventions.
  • Figure 4: The prompt and JSON messages used for grading. The variables stu_url and rub_url would be replaced by the ephemeral URLs generated by our server.
  • Figure 5: Filtering AI-generated student-item grading results.
  • ...and 10 more figures