Improve LLM-based Automatic Essay Scoring with Linguistic Features
Zhaoyi Joey Hou, Alejandro Ciuba, Xiang Lorraine Li
TL;DR
Automatic Essay Scoring (AES) systems struggle to generalize across prompts due to context-specific writing tasks. The authors propose a hybrid approach that injects linguistically motivated features into zero-shot LLM prompts and uses a dedicated parsing module to produce consistent scores, evaluating on ASAP and ELLIPSE with open (Mistral-7B-Instruct-v0.2) and closed (GPT-4) LLMs. Across in-domain and out-of-domain prompts, incorporating linguistic features improves LLM-based scoring, with open-source LLMs approaching GPT-4 performance in several settings and some variability depending on dataset. The work contributes a practical, interpretable pipeline for cross-prompt AES and highlights the value and limitations of feature-informed prompting for scalable, generalizable essay scoring. It also calls for broader evaluation with diverse LLMs and datasets to push toward more robust, interpretable AES in real-world settings.
Abstract
Automatic Essay Scoring (AES) assigns scores to student essays, reducing the grading workload for instructors. Developing a scoring system capable of handling essays across diverse prompts is challenging due to the flexibility and diverse nature of the writing task. Existing methods typically fall into two categories: supervised feature-based approaches and large language model (LLM)-based methods. Supervised feature-based approaches often achieve higher performance but require resource-intensive training. In contrast, LLM-based methods are computationally efficient during inference but tend to suffer from lower performance. This paper combines these approaches by incorporating linguistic features into LLM-based scoring. Experimental results show that this hybrid method outperforms baseline models for both in-domain and out-of-domain writing prompts.
