Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Seungju Kim; Meounggun Jo

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Seungju Kim, Meounggun Jo

TL;DR

This work addresses the limit of GPT-4 alone for Automated Essay Scoring (AES) by pairing Large Language Models with Comparative Judgment (CJ). Using GPT-3.5 and GPT-4, the study compares rubric-based scoring (Basic and elaborated rubrics) with CJ-based scoring across ASAP essay sets 7 and 8, evaluating via Quadratic Weighted Kappa (QWK) and transforming CJ outputs to absolute scores. The results show CJ-based scoring, especially when combined with elaborated rubrics and fine-grained scores (CJ_F), yields substantial improvements in imitating human rater scores, with GPT-4 delivering the strongest gains. The findings imply that scalable AES should combine CJ with LLMs rather than rely on GPT-4 alone, and they highlight practical directions for rubric design, dataset construction, and human–AI collaboration in educational assessment, formalized with $P(A\ beats\ B) = \frac{e^{\lambda_a-\lambda_b}}{1+e^{\lambda_a-\lambda_b}}$ in the Bradley–Terry framework and $\text{QWK}$ as the evaluation metric.

Abstract

Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

TL;DR

in the Bradley–Terry framework and

as the evaluation metric.

Abstract

Paper Structure (45 sections, 3 equations, 2 figures, 10 tables)

This paper contains 45 sections, 3 equations, 2 figures, 10 tables.

Introduction
Related Work
Automated Essay Scoring
Performance of LLMs in AES
Limitations of Fine-tuning-based Methods
Effects of Prompt Engineering
Rater Cognition in Essay Scoring
Rubric-based Scoring
Comparative Judgment
Research Questions
Methods
Dataset
Models
Rubric-based Scoring Strategy
Basic-type Rubric
...and 30 more sections

Figures (2)

Figure 2: Performance Improvements with CJ-based Scoring Across Models
Figure 3: Performance Improvements of CJ and CJ_F Across Rubric Types

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

TL;DR

Abstract

Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)