Explainable Automatic Grading with Neural Additive Models
Aubrey Condor, Zachary Pardos
TL;DR
The paper addresses the explainability gap in ASAG by adopting Neural Additive Models (NAMs), which express predictions as a sum of univariate feature functions $g(E[y]) = \sum_{i=1}^K f_i(x_i)$ to maintain interpretability. It leverages Knowledge Integration (KI) rubrics to engineer 62 features derived from semantic similarity between rubric phrases and response n-grams using sentence-BERT embeddings, enabling NAMs and logistic regression to be trained on the same features. Compared against DeBERTaV3-base and LR, NAMs offer a transparent view of per-feature contributions and deliver competitive performance, outperforming LR on KI data and approaching DeBERTa, though DeBERTa remains the strongest overall. The findings suggest NAMs can provide useful, explainable scoring insights for educators while maintaining solid predictive power, with potential for expansion to additional domains and deeper usability studies.
Abstract
The use of automatic short answer grading (ASAG) models may help alleviate the time burden of grading while encouraging educators to frequently incorporate open-ended items in their curriculum. However, current state-of-the-art ASAG models are large neural networks (NN) often described as "black box", providing no explanation for which characteristics of an input are important for the produced output. This inexplicable nature can be frustrating to teachers and students when trying to interpret, or learn from an automatically-generated grade. To create a powerful yet intelligible ASAG model, we experiment with a type of model called a Neural Additive Model that combines the performance of a NN with the explainability of an additive model. We use a Knowledge Integration (KI) framework from the learning sciences to guide feature engineering to create inputs that reflect whether a student includes certain ideas in their response. We hypothesize that indicating the inclusion (or exclusion) of predefined ideas as features will be sufficient for the NAM to have good predictive power and interpretability, as this may guide a human scorer using a KI rubric. We compare the performance of the NAM with another explainable model, logistic regression, using the same features, and to a non-explainable neural model, DeBERTa, that does not require feature engineering.
