Is GPT-4 Alone Sufficient for Automated Essay Scoring?: A Comparative Judgment Approach Based on Rater Cognition
Seungju Kim, Meounggun Jo
TL;DR
This work addresses the limit of GPT-4 alone for Automated Essay Scoring (AES) by pairing Large Language Models with Comparative Judgment (CJ). Using GPT-3.5 and GPT-4, the study compares rubric-based scoring (Basic and elaborated rubrics) with CJ-based scoring across ASAP essay sets 7 and 8, evaluating via Quadratic Weighted Kappa (QWK) and transforming CJ outputs to absolute scores. The results show CJ-based scoring, especially when combined with elaborated rubrics and fine-grained scores (CJ_F), yields substantial improvements in imitating human rater scores, with GPT-4 delivering the strongest gains. The findings imply that scalable AES should combine CJ with LLMs rather than rely on GPT-4 alone, and they highlight practical directions for rubric design, dataset construction, and human–AI collaboration in educational assessment, formalized with $P(A\ beats\ B) = \frac{e^{\lambda_a-\lambda_b}}{1+e^{\lambda_a-\lambda_b}}$ in the Bradley–Terry framework and $\text{QWK}$ as the evaluation metric.
Abstract
Large Language Models (LLMs) have shown promise in Automated Essay Scoring (AES), but their zero-shot and few-shot performance often falls short compared to state-of-the-art models and human raters. However, fine-tuning LLMs for each specific task is impractical due to the variety of essay prompts and rubrics used in real-world educational contexts. This study proposes a novel approach combining LLMs and Comparative Judgment (CJ) for AES, using zero-shot prompting to choose between two essays. We demonstrate that a CJ method surpasses traditional rubric-based scoring in essay scoring using LLMs.
