EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models
Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
TL;DR
EssayJudge introduces the first multimodal, multi-granular benchmark for automated essay scoring (AES) that evaluates Multimodal Large Language Models across lexical, sentence, and discourse traits using text and image inputs. It provides a 1,054-essay multimodal dataset with ten trait rubrics and ground-truth scores via expert consensus, enabling trait-specific evaluation with a Quadratic Weighted Kappa metric. Across 18 MLLMs (open- and closed-source) and human judges, the study finds closed-source models outperform open-source ones but still lag behind human scoring, especially on discourse-level traits, highlighting the need for advances in multimodal reasoning and evaluation. The work also analyzes image modality effects, reveals model-specific trait strengths and weaknesses, and offers insights into dataset design and evaluation practices, aiming to guide future AES research toward more accurate, robust, and interpretable multimodal scoring systems.
Abstract
Automated Essay Scoring (AES) plays a crucial role in educational assessment by providing scalable and consistent evaluations of writing tasks. However, traditional AES systems face three major challenges: (1) reliance on handcrafted features that limit generalizability, (2) difficulty in capturing fine-grained traits like coherence and argumentation, and (3) inability to handle multimodal contexts. In the era of Multimodal Large Language Models (MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES capabilities across lexical-, sentence-, and discourse-level traits. By leveraging MLLMs' strengths in trait-specific scoring and multimodal context understanding, EssayJudge aims to offer precise, context-rich evaluations without manual feature engineering, addressing longstanding AES limitations. Our experiments with 18 representative MLLMs reveal gaps in AES performance compared to human evaluation, particularly in discourse-level traits, highlighting the need for further advancements in MLLM-based AES research.
