Table of Contents
Fetching ...

LLMs Do Not Grade Essays Like Humans

Jerin George Mathew, Sumayya Taher, Anindita Kundu, Denilson Barbosa

Abstract

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.

LLMs Do Not Grade Essays Like Humans

Abstract

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.
Paper Structure (39 sections, 42 figures, 2 tables)

This paper contains 39 sections, 42 figures, 2 tables.

Figures (42)

  • Figure 1: Overview of the proposed analysis framework. Essays are first evaluated by LLMs to produce both predicted scores and textual feedback. Essay-level features and feedback-derived signals are then extracted and analyzed to compare LLM scores with human ratings and to examine how feedback sentiment and essay characteristics relate to the scores assigned by the models.
  • Figure 2: Distribution of human-assigned essay scores across datasets.
  • Figure 3: Distribution of essay scores assigned by Llama and GPT models on the ASAP Task 7 dataset. Most models produce scores concentrated around the middle of the scale, while GPT-3.5 assigns comparatively lower scores with predictions concentrated toward the lower end.
  • Figure 4: Quadratic Weighted Kappa (QWK) scores between LLM-generated scores and human ratings across datasets. Lighter cells indicate greater agreement. While human raters exhibit relatively strong agreement with each other, agreement between LLM predictions and human scores remains generally lower across models and datasets.
  • Figure 5: Pearson correlation between LLM-generated scores and human ratings across datasets. Lighter cells indicate greater agreement. Although some models show moderate correlations with human scores, the overall agreement remains lower than the correlation observed between human raters.
  • ...and 37 more figures