Table of Contents
Fetching ...

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability

Kaixun Yang, Mladen Raković, Yuyang Li, Quanlong Guan, Dragan Gašević, Guanliang Chen

TL;DR

The paper investigates how AES models balance accuracy, fairness, and generalizability across prompt-specific and cross-prompt settings using a large public dataset. It compares nine methods (five prompt-specific, four cross-prompt) with seven metrics, showing prompt-specific models generally outperform cross-prompt models in accuracy, while cross-prompt models can be fairer in some cases. Economic status emerges as a major bias driver, with prompt-specific models tending to exhibit more economic bias than cross-prompt ones, though pre-trained language models achieve strong performance in target-prompt settings. The work highlights a trade-off between accuracy and fairness and suggests that traditional, well-engineered features can offer favorable generalizability and fairness, while neural models excel in prompt-specific accuracy; limitations include reliance on a single dataset and lack of mitigation strategies for bias.

Abstract

Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability

TL;DR

The paper investigates how AES models balance accuracy, fairness, and generalizability across prompt-specific and cross-prompt settings using a large public dataset. It compares nine methods (five prompt-specific, four cross-prompt) with seven metrics, showing prompt-specific models generally outperform cross-prompt models in accuracy, while cross-prompt models can be fairer in some cases. Economic status emerges as a major bias driver, with prompt-specific models tending to exhibit more economic bias than cross-prompt ones, though pre-trained language models achieve strong performance in target-prompt settings. The work highlights a trade-off between accuracy and fairness and suggests that traditional, well-engineered features can offer favorable generalizability and fairness, while neural models excel in prompt-specific accuracy; limitations include reliance on a single dataset and lack of mitigation strategies for bias.

Abstract

Automatic Essay Scoring (AES) is a well-established educational pursuit that employs machine learning to evaluate student-authored essays. While much effort has been made in this area, current research primarily focuses on either (i) boosting the predictive accuracy of an AES model for a specific prompt (i.e., developing prompt-specific models), which often heavily relies on the use of the labeled data from the same target prompt; or (ii) assessing the applicability of AES models developed on non-target prompts to the intended target prompt (i.e., developing the AES models in a cross-prompt setting). Given the inherent bias in machine learning and its potential impact on marginalized groups, it is imperative to investigate whether such bias exists in current AES methods and, if identified, how it intervenes with an AES model's accuracy and generalizability. Thus, our study aimed to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability, contributing practical insights for developing effective AES models in real-world education. To this end, we meticulously selected nine prominent AES methods and evaluated their performance using seven metrics on an open-sourced dataset, which contains over 25,000 essays and various demographic information about students such as gender, English language learner status, and economic status. Through extensive evaluations, we demonstrated that: (1) prompt-specific models tend to outperform their cross-prompt counterparts in terms of predictive accuracy; (2) prompt-specific models frequently exhibit a greater bias towards students of different economic statuses compared to cross-prompt models; (3) in the pursuit of generalizability, traditional machine learning models coupled with carefully engineered features hold greater potential for achieving both high accuracy and fairness than complex neural network models.
Paper Structure (13 sections, 14 tables)