Table of Contents
Fetching ...

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

Yida Cai, Kun Liang, Sanwoo Lee, Qinghan Wang, Yunfang Wu

TL;DR

This paper proposes Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities, which consistently outperforms the direct prompting method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

Abstract

In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring

TL;DR

This paper proposes Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities, which consistently outperforms the direct prompting method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

Abstract

In recent years, large language models (LLMs) achieve remarkable success across a variety of tasks. However, their potential in the domain of Automated Essay Scoring (AES) remains largely underexplored. Moreover, compared to English data, the methods for Chinese AES is not well developed. In this paper, we propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities. Specifically, we fine-tune the ranking model (Ranker) with feature-enriched data, and then feed the output of the ranking model, in the form of a candidate score set, with the essay content into the scoring model (Scorer) to produce the final score. Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK across all LLMs and datasets, and achieves the best performance on Chinese essay scoring using the HSK dataset.

Paper Structure

This paper contains 21 sections, 5 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: The overall architecture of RTS is illustrated in the figure. Excluding the training process, the method is divided into the following four steps: (1) Select reference essays. (2) use the feature extractor to identify the features of the essays, and incorporate these features into the essay content. (3) Utilize the Ranker to obtain the candidate score set of the current essay through pairwise ranking. (4) Feed the candidate score set, along with the essay, into the Scorer to generate final score.
  • Figure 2: Instruction for fine-tuning the Ranker. Contents to be filled are highlighted in red.
  • Figure 3: The BST-like inference process.
  • Figure 4: Another scenario of the BST-like approach.
  • Figure 5: Instruction for fine-tuning the Scorer. Contents to be filled are highlighted in red.
  • ...and 11 more figures