Table of Contents
Fetching ...

Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring

Heejin Do, Sangwon Ryu, Gary Geunbae Lee

TL;DR

This paper addresses the opacity of multi-trait automated essay scoring by introducing RaDME, a self-explainable framework that distills the reasoning of a teacher LLM into a lightweight student model. RaDME first predicts trait scores and then generates corresponding rationales, with the training driven by score-guided rationale distillation to ensure alignment between scores and explanations. It demonstrates strong scoring performance across traits and prompts on the ASAP/ASAP++ dataset, while producing high-quality, human-interpretable rationales and enabling inference without LLMs. The approach substantially enhances transparency in AES and offers a scalable path toward reliable, explainable feedback in educational settings.

Abstract

Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay's diverse aspects. While they excel in scoring, prior systems fail to explain why specific trait scores are assigned. This lack of transparency leaves instructors and learners unconvinced of the AES outputs, hindering their practical use. To address this, we propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME leverages the reasoning capabilities of large language models (LLMs) by distilling them into a smaller yet effective scorer. This more manageable student model is optimized to sequentially generate a trait score followed by the corresponding rationale, thereby inherently learning to select a more justifiable score by considering the subsequent rationale during training. Our findings indicate that while LLMs underperform in direct AES tasks, they excel in rationale generation when provided with precise numerical scores. Thus, RaDME integrates the superior reasoning capacities of LLMs into the robust scoring accuracy of an optimized smaller model. Extensive experiments demonstrate that RaDME achieves both accurate and adequate reasoning while supporting high-quality multi-trait scoring, significantly enhancing the transparency of AES.

Teach-to-Reason with Scoring: Self-Explainable Rationale-Driven Multi-Trait Essay Scoring

TL;DR

This paper addresses the opacity of multi-trait automated essay scoring by introducing RaDME, a self-explainable framework that distills the reasoning of a teacher LLM into a lightweight student model. RaDME first predicts trait scores and then generates corresponding rationales, with the training driven by score-guided rationale distillation to ensure alignment between scores and explanations. It demonstrates strong scoring performance across traits and prompts on the ASAP/ASAP++ dataset, while producing high-quality, human-interpretable rationales and enabling inference without LLMs. The approach substantially enhances transparency in AES and offers a scalable path toward reliable, explainable feedback in educational settings.

Abstract

Multi-trait automated essay scoring (AES) systems provide a fine-grained evaluation of an essay's diverse aspects. While they excel in scoring, prior systems fail to explain why specific trait scores are assigned. This lack of transparency leaves instructors and learners unconvinced of the AES outputs, hindering their practical use. To address this, we propose a self-explainable Rationale-Driven Multi-trait automated Essay scoring (RaDME) framework. RaDME leverages the reasoning capabilities of large language models (LLMs) by distilling them into a smaller yet effective scorer. This more manageable student model is optimized to sequentially generate a trait score followed by the corresponding rationale, thereby inherently learning to select a more justifiable score by considering the subsequent rationale during training. Our findings indicate that while LLMs underperform in direct AES tasks, they excel in rationale generation when provided with precise numerical scores. Thus, RaDME integrates the superior reasoning capacities of LLMs into the robust scoring accuracy of an optimized smaller model. Extensive experiments demonstrate that RaDME achieves both accurate and adequate reasoning while supporting high-quality multi-trait scoring, significantly enhancing the transparency of AES.

Paper Structure

This paper contains 26 sections, 2 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Comparison of existing multi-trait scoring methods (top) and RaDME (bottom). While existing methods take features or rationales as input, not allowing direct interpretation of the results; however, RaDME explicitly derives scores followed by its rationales, enhancing the reliability of the outcomes.
  • Figure 2: An overview of the RaDME framework.
  • Figure 3: Evaluation of win rates for accuracy and relevance between rationales generated by the student model and those generated by the LLM on the test set.
  • Figure 4: Evaluation results with G-Eval.
  • Figure 5: Comparison of rationales generated by different models for the Language trait, in the case of a score of 2. Bolded models represent our proposed methods, while green highlights indicate well-specified phrases within the rationales.
  • ...and 6 more figures