Operationalizing Automated Essay Scoring: A Human-Aware Approach
Yenisel Plasencia-Calaña
TL;DR
The paper addresses human-aware operationalization of Automated Essay Scoring (AES), arguing that accuracy is insufficient for real-world educational use. It compares ML-based AES (including handcrafted features and transformer-based models) with zero-shot and few-shot Large Language Models (LLMs) on the PERSUADE 2.0 dataset, evaluating robustness, bias, and explainability in addition to accuracy. The findings show ML-based AES often achieves higher agreement with human raters as measured by Quadratic Weighted Kappa, while LLMs offer richer, more accessible explanations but exhibit instability and higher resource demands; both approaches display bias across demographic attributes. The work highlights important trade-offs and stresses the need for AES systems that balance accuracy with reliability, fairness, and interpretability for trustworthy educational deployment, guiding future design choices in human-centered AES.
Abstract
This paper explores the human-centric operationalization of Automated Essay Scoring (AES) systems, addressing aspects beyond accuracy. We compare various machine learning-based approaches with Large Language Models (LLMs) approaches, identifying their strengths, similarities and differences. The study investigates key dimensions such as bias, robustness, and explainability, considered important for human-aware operationalization of AES systems. Our study shows that ML-based AES models outperform LLMs in accuracy but struggle with explainability, whereas LLMs provide richer explanations. We also found that both approaches struggle with bias and robustness to edge scores. By analyzing these dimensions, the paper aims to identify challenges and trade-offs between different methods, contributing to more reliable and trustworthy AES methods.
