Table of Contents
Fetching ...

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu

TL;DR

EssayJudge introduces the first multimodal, multi-granular benchmark for automated essay scoring (AES) that evaluates Multimodal Large Language Models across lexical, sentence, and discourse traits using text and image inputs. It provides a 1,054-essay multimodal dataset with ten trait rubrics and ground-truth scores via expert consensus, enabling trait-specific evaluation with a Quadratic Weighted Kappa metric. Across 18 MLLMs (open- and closed-source) and human judges, the study finds closed-source models outperform open-source ones but still lag behind human scoring, especially on discourse-level traits, highlighting the need for advances in multimodal reasoning and evaluation. The work also analyzes image modality effects, reveals model-specific trait strengths and weaknesses, and offers insights into dataset design and evaluation practices, aiming to guide future AES research toward more accurate, robust, and interpretable multimodal scoring systems.

Abstract

Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research.

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

TL;DR

EssayJudge introduces the first multimodal, multi-granular benchmark for automated essay scoring (AES) that evaluates Multimodal Large Language Models across lexical, sentence, and discourse traits using text and image inputs. It provides a 1,054-essay multimodal dataset with ten trait rubrics and ground-truth scores via expert consensus, enabling trait-specific evaluation with a Quadratic Weighted Kappa metric. Across 18 MLLMs (open- and closed-source) and human judges, the study finds closed-source models outperform open-source ones but still lag behind human scoring, especially on discourse-level traits, highlighting the need for advances in multimodal reasoning and evaluation. The work also analyzes image modality effects, reveals model-specific trait strengths and weaknesses, and offers insights into dataset design and evaluation practices, aiming to guide future AES research toward more accurate, robust, and interpretable multimodal scoring systems.

Abstract

Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research.

Paper Structure

This paper contains 34 sections, 1 equation, 31 figures, 18 tables.

Figures (31)

  • Figure 1: Comparison of task settings between the previous evaluation paradigm (a) and our proposed EssayJudge benchmark (b) on automated essay scoring task.
  • Figure 1: Comparison between previous AES benchmarks and our proposed EssayJudge. The cells highlighted in red indicate the highest number for #Topics and #Traits columns, and the unique modality for Modality column.
  • Figure 2: Roadmap illustration of EssayJudge dataset collection, construction and annotation.
  • Figure 3: The open-source and closed-source MLLMs' distribution of average scores among ten traits.
  • Figure 3: Key statistics of EssayJudge dataset.
  • ...and 26 more figures