Table of Contents
Fetching ...

RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring

Ali Ghiasvand Mohammadkhani

TL;DR

The paper addresses the lack of interpretability in automatic essay scoring by introducing Reasoning Distillation-Based Evaluation (RDBE), a two-step framework that first uses a large language model to generate rationales for scores and then distills this reasoning into a smaller backbone model. By training on reasoning-based outputs derived from the DREsS_New dataset, RDBE delivers both interpretable evaluations and improved scoring performance, achieving state-of-the-art results measured by the quadratic weighted kappa ($QWK$). The approach demonstrates that incorporating reasoning and interpretation during training enhances the reliability of AES while enabling efficient deployment on edge devices using compact models like LongT5-Base. The work suggests broad applicability to other long-form evaluation tasks and highlights the importance of data quality in synthetic reasoning generation as a key area for future improvement.

Abstract

Recently, various encoder-only and encoder-decoder pre-trained models like BERT and T5 have been applied to automatic essay scoring (AES) as small language models. However, existing studies have primarily treated this task akin to a classification problem, focusing solely on outputting scores in the target text without offering interpretations for the generated scores. Departing from the approaches, we introduce Reasoning Distillation-Based Evaluation (RDBE), which integrates interpretability to elucidate the rationale behind model scores while enhancing performance through initial reasoning. This interpretive capability is acquired during training by leveraging generated reasoning from a large language model (LLM) to distill a small language model (SLM). Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset. RDBE outperforms both zero-shot LLM generation and generation from a baseline fine-tuned model, establishing itself as state-of-the-art in the corresponding dataset. This highlights its practical interpretative output and enhanced performance.

RDBE: Reasoning Distillation-Based Evaluation Enhances Automatic Essay Scoring

TL;DR

The paper addresses the lack of interpretability in automatic essay scoring by introducing Reasoning Distillation-Based Evaluation (RDBE), a two-step framework that first uses a large language model to generate rationales for scores and then distills this reasoning into a smaller backbone model. By training on reasoning-based outputs derived from the DREsS_New dataset, RDBE delivers both interpretable evaluations and improved scoring performance, achieving state-of-the-art results measured by the quadratic weighted kappa (). The approach demonstrates that incorporating reasoning and interpretation during training enhances the reliability of AES while enabling efficient deployment on edge devices using compact models like LongT5-Base. The work suggests broad applicability to other long-form evaluation tasks and highlights the importance of data quality in synthetic reasoning generation as a key area for future improvement.

Abstract

Recently, various encoder-only and encoder-decoder pre-trained models like BERT and T5 have been applied to automatic essay scoring (AES) as small language models. However, existing studies have primarily treated this task akin to a classification problem, focusing solely on outputting scores in the target text without offering interpretations for the generated scores. Departing from the approaches, we introduce Reasoning Distillation-Based Evaluation (RDBE), which integrates interpretability to elucidate the rationale behind model scores while enhancing performance through initial reasoning. This interpretive capability is acquired during training by leveraging generated reasoning from a large language model (LLM) to distill a small language model (SLM). Our experimental results demonstrate the efficacy of RDBE across all scoring rubrics considered in the dataset. RDBE outperforms both zero-shot LLM generation and generation from a baseline fine-tuned model, establishing itself as state-of-the-art in the corresponding dataset. This highlights its practical interpretative output and enhanced performance.
Paper Structure (11 sections, 1 figure, 1 table)

This paper contains 11 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Automated Essay Scoring Framework: Step 1 involves using zero-shot prompting via a language model (LLM) to generate reasoning for each input prompt containing the subject, essay, scoring rubric, and score. Step 2 entails supervised fine-tuning to produce diverse reasoning and a concluded score, refining the essay evaluation process.