Table of Contents
Fetching ...

Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs

SeongYeub Chu, JongWoo Kim, Bryan Wong, MunYong Yi

TL;DR

This work introduces RMTS, a two-stage framework that leverages prompt-engineered LLMs to generate trait-specific, rubric-aligned rationales and a shared encoder-decoder S-LLM to produce trait scores. By embedding rationale content alongside essays, RMTS improves trait-wise reliability and interpretability in automated essay scoring, achieving consistent gains on ASAP/ASAP++ and the Feedback Prize dataset over strong baselines. The approach highlights the value of rationale-informed decoding for rubric-aligned evaluation and provides evidence of faithfulness and similarity properties of generated rationales. Practical impact includes improved multi-trait scoring accuracy and transparency, with code available for reproducibility.

Abstract

Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine-grained manner. This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM). RMTS uses an LLM-based trait-wise rationale generation system where a separate LLM agent generates trait-specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi-trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring. By assisting quantitative assessment with fine-grained qualitative rationales, RMTS enhances the trait-wise reliability, providing partial explanations about essays. The code is available at https://github.com/BBeeChu/RMTS.git.

Rationale Behind Essay Scores: Enhancing S-LLM's Multi-Trait Essay Scoring with Rationale Generated by LLMs

TL;DR

This work introduces RMTS, a two-stage framework that leverages prompt-engineered LLMs to generate trait-specific, rubric-aligned rationales and a shared encoder-decoder S-LLM to produce trait scores. By embedding rationale content alongside essays, RMTS improves trait-wise reliability and interpretability in automated essay scoring, achieving consistent gains on ASAP/ASAP++ and the Feedback Prize dataset over strong baselines. The approach highlights the value of rationale-informed decoding for rubric-aligned evaluation and provides evidence of faithfulness and similarity properties of generated rationales. Practical impact includes improved multi-trait scoring accuracy and transparency, with code available for reproducibility.

Abstract

Existing automated essay scoring (AES) has solely relied on essay text without using explanatory rationales for the scores, thereby forgoing an opportunity to capture the specific aspects evaluated by rubric indicators in a fine-grained manner. This paper introduces Rationale-based Multiple Trait Scoring (RMTS), a novel approach for multi-trait essay scoring that integrates prompt-engineering-based large language models (LLMs) with a fine-tuning-based essay scoring model using a smaller large language model (S-LLM). RMTS uses an LLM-based trait-wise rationale generation system where a separate LLM agent generates trait-specific rationales based on rubric guidelines, which the scoring model uses to accurately predict multi-trait scores. Extensive experiments on benchmark datasets, including ASAP, ASAP++, and Feedback Prize, show that RMTS significantly outperforms state-of-the-art models and vanilla S-LLMs in trait-specific scoring. By assisting quantitative assessment with fine-grained qualitative rationales, RMTS enhances the trait-wise reliability, providing partial explanations about essays. The code is available at https://github.com/BBeeChu/RMTS.git.

Paper Structure

This paper contains 41 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Unlike existing methods (A), we use multiple prompt-engineering LLMs to generate trait-specific rationales based on rubric guidelines as shown in (B), which are then combined with an S-LLM for comprehensive evaluation.
  • Figure 2: Trait-specific rationales are constructed using the essay prompt, the essay, and the rubric guidelines corresponding to each trait. To generate the final rationale for each essay, we combine the trait-specific rationales in sequence.
  • Figure 3: The final rationale generated by multiple LLM agents and the essay are fed into a shared encoder to extract their representations. These representations are then projected to a unified feature vector by a linear layer and passed through the decoder, which predicts trait-specific scores in sequence.
  • Figure 4: ROUGE scores of rationales within the same essay or between different essays across GPT and Llama.
  • Figure 5: Performance comparison of S-LLMs based on QWK scores, averaged across all prompts for each trait with regard to the ASAP/ASAP++ dataset, using either the essays or the rationales generated by GPT or Llama.
  • ...and 6 more figures