Table of Contents
Fetching ...

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou

Abstract

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Abstract

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.
Paper Structure (36 sections, 12 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 12 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: The CHiL(L)Grader loop over two iterations for similar responses to the same question. In Iteration I, the model predicts 3/5 with low confidence (40%), triggering teacher review; the corrected 4/5 grade is used for fine‑tuning. In Iteration II, the updated model predicts 4/5 with high confidence (80%), enabling automatic acceptance.
  • Figure 2: CHiL(L)Grader architecture. Historical exams are used to train the instruction‑tuned model. Prior‑year exams calibrate its confidence. During the current exam, low‑confidence cases are sent to human review, whose corrections, combined with replay samples, guide conservative model updates and recalibration.
  • Figure 3: Exact Match and Off-by-1 accuracy for Qwen-2.5-7B across DAMI, SciEntsBank, and EngSAF. The improvement across datasets reflects scale granularity rather than model quality, Off-by-1 on EngSAF ($G \in \{0,1,2\}$) is a much coarser tolerance than on DAMI ($G \in \{0,\ldots,10\}$).
  • Figure 4: Confusion matrices for baseline models on DAMI, SciEntsBank, and EngSAF. DAMI errors concentrate within $\pm 1$ of the diagonal; SciEntsBank shows systematic under-prediction of grade 4; EngSAF confirms the relative simplicity of 3-way grading.
  • Figure 5: Coverage--quality curves for DAMI, SciEntsBank, and EngSAF. Each point corresponds to a specific $\tau$ value; selected operating points are marked. Restricting coverage to high-confidence predictions consistently improves accepted-set QWK across all three datasets.
  • ...and 6 more figures

Theorems & Definitions (4)

  • definition 1: Rubric
  • definition 2: Overconfidence
  • definition 3: Distribution Shift
  • definition 4: Expected Calibration Error