Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments
Yunsung Kim, Mike Hardy, Joseph Tey, Candace Thille, Chris Piech
TL;DR
The paper addresses the need for transparent automated scoring of complex student responses in large-scale assessments. It proposes four FGTI principles—Faithful, Grounded, Traceable, and Interchangeable—and implements them in AnalyticScore, a three-phase framework that extracts explicit analytic components, featurizes responses into human-interpretable indicators, and scores via an ordinal logistic model. A key contribution is the explicit, human-understandable scoring pipeline, including a distillation step to open-source Featurizer models and a formulaic scoring rule where the evidence value is $ abla \eta = \sum^c_{i=1} w_{i,f(r,c_i)}$ with decision thresholds $\theta_j$ satisfying $\theta_j \le \eta < \theta_{j+1}$, enabling faithful explanations and potential human intervention. On the ASAP-SAS dataset, AnalyticScore outperforms several uninterpretable baselines and remains within $0.06$ QWK of the uninterpretable SOTA, while featurization alignments with human judgments show strong correspondence, illustrating practical potential for interpretable automated scoring in real-world assessments.
Abstract
AI-driven automated scoring systems offer scalable and efficient means of evaluating complex student-generated responses. Yet, despite increasing demand for transparency and interpretability, the field has yet to develop a widely accepted solution for interpretable automated scoring to be used in large-scale real-world assessments. This work takes a principled approach to address this challenge. We analyze the needs and potential benefits of interpretable automated scoring for various assessment stakeholders and develop four principles of interpretability -- Faithfulness, Groundedness, Traceability, and Interchangeability (FGTI) -- targeted at those needs. To illustrate the feasibility of implementing these principles, we develop the AnalyticScore framework for short answer scoring as a baseline reference framework for future research. AnalyticScore operates by (1) extracting explicitly identifiable elements of the responses, (2) featurizing each response into human-interpretable values using LLMs, and (3) applying an intuitive ordinal logistic regression model for scoring. In terms of scoring accuracy, AnalyticScore outperforms many uninterpretable scoring methods, and is within only 0.06 QWK of the uninterpretable SOTA on average across 10 items from the ASAP-SAS dataset. By comparing against human annotators conducting the same featurization task, we further demonstrate that the featurization behavior of AnalyticScore aligns well with that of humans.
