Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards
Heejin Do, Sangwon Ryu, Gary Geunbae Lee
TL;DR
This work tackles multi-trait automated essay scoring by addressing the non-differentiability of $QWK$ through a scoring-aware reinforcement learning framework. It introduces Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which uses an autoregressive score-generation model and two rewards—bi-directional $QWK$ and a mean-trait squared error penalty—updated via PPO with KL regularization against a fixed anchor. Empirical results on ASAP, ASAP++, and Feedback Prize datasets show state-of-the-art trait-wise performance across most prompts and trait sets, with ablations confirming the benefits of multi-reward optimization and dynamic weight learning. The approach demonstrates robustness across varying prompt types and data sizes, though it notes limitations related to trait-prediction order and potential per-token policy updates for future improvement.
Abstract
Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an autoregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.
