RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring
Ali Ghiasvand Mohammadkhani
TL;DR
The paper addresses the lack of interpretability in automatic essay scoring by introducing Reasoning Distillation-Based Evaluation (RDBE), a two-step framework that first uses a large language model to generate rationales for scores and then distills this reasoning into a smaller backbone model. By training on reasoning-based outputs derived from the DREsS_New dataset, RDBE delivers both interpretable evaluations and improved scoring performance, achieving state-of-the-art results measured by the quadratic weighted kappa ($QWK$). The approach demonstrates that incorporating reasoning and interpretation during training enhances the reliability of AES while enabling efficient deployment on edge devices using compact models like LongT5-Base. The work suggests broad applicability to other long-form evaluation tasks and highlights the importance of data quality in synthetic reasoning generation as a key area for future improvement.
Abstract
Recently, various encoder-only and encoder-decoder pre-trained models like BERT and T5 have been applied to automatic essay scoring (AES) as small language models. However, existing studies have primarily treated this task akin to a classification problem, focusing solely on outputting scores in the target text without offering interpretations for the generated scores. Departing from the approaches, we introduce Reasoning Distillation-Based Evaluation (RDBE), which integrates interpretability to elucidate the rationale behind model scores while enhancing performance through initial reasoning. This interpretive capability is acquired during training by leveraging generated reasoning from a large language model (LLM) to distill a small language model (SLM). Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset. RDBE outperforms both zero-shot LLM generation and generation from a baseline fine-tuned model, establishing itself as state-of-the-art in the corresponding dataset. This highlights its practical interpretative output and enhanced performance.
