Table of Contents
Fetching ...

Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments

Abhirup Chakravarty, Mark Brenchley, Trevor Breakspear, Ian Lewin, Yan Huang

TL;DR

This work tackles the ethical challenge of releasing AES scores by developing ordinal, confidence-aware scoring mechanisms. It reframes confidence estimation as ordinal classification, introducing N-ary CEFR and score-level binning, and proposes Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) losses within a Hybrid Marking System. The approach yields substantial improvements in safe score release, achieving up to 47% release with 100% CEFR agreement and up to 99% with at least 95% agreement, outperforming unaided AES in reliability. These results demonstrate that leveraging the ordinal structure of CEFR and kernel-based penalties can significantly enhance the safety and practicality of automated language assessment deployment.

Abstract

A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.

Enhancing Marker Scoring Accuracy through Ordinal Confidence Modelling in Educational Assessments

TL;DR

This work tackles the ethical challenge of releasing AES scores by developing ordinal, confidence-aware scoring mechanisms. It reframes confidence estimation as ordinal classification, introducing N-ary CEFR and score-level binning, and proposes Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) losses within a Hybrid Marking System. The approach yields substantial improvements in safe score release, achieving up to 47% release with 100% CEFR agreement and up to 99% with at least 95% agreement, outperforming unaided AES in reliability. These results demonstrate that leveraging the ordinal structure of CEFR and kernel-based penalties can significantly enhance the safety and practicality of automated language assessment deployment.

Abstract

A key ethical challenge in Automated Essay Scoring (AES) is ensuring that scores are only released when they meet high reliability standards. Confidence modelling addresses this by assigning a reliability estimate measure, in the form of a confidence score, to each automated score. In this study, we frame confidence estimation as a classification task: predicting whether an AES-generated score correctly places a candidate in the appropriate CEFR level. While this is a binary decision, we leverage the inherent granularity of the scoring domain in two ways. First, we reformulate the task as an n-ary classification problem using score binning. Second, we introduce a set of novel Kernel Weighted Ordinal Categorical Cross Entropy (KWOCCE) loss functions that incorporate the ordinal structure of CEFR labels. Our best-performing model achieves an F1 score of 0.97, and enables the system to release 47% of scores with 100% CEFR agreement and 99% with at least 95% CEFR agreement -compared to approximately 92% (approx.) CEFR agreement from the standalone AES model where we release all AM predicted scores.

Paper Structure

This paper contains 25 sections, 11 equations, 11 tables.