From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen; Vu Hoang Pham; Xuan Thanh Huynh; Phuc Hong Mai; Vinh The Nguyen; Quang Nhut Huynh; Huy Tien Nguyen; Tung Le

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh, Huy Tien Nguyen, Tung Le

TL;DR

This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks.

Abstract

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

TL;DR

This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks.

Abstract

Paper Structure (40 sections, 21 equations, 10 figures, 7 tables)

This paper contains 40 sections, 21 equations, 10 figures, 7 tables.

Introduction
Related Work
Automated Essay Scoring (AES)
From Traditional Feature-based to Deep Learning-based AES
LLM-Based Approaches and IELTS Writing Assessment
Our methods
Problem Formulation
LLM-Based Scoring and Feedback Generation.
Approach 1: Discriminative Fine-Tunings
Encoder-based Architecture.
Training Objective.
Approach 2: In-context Learning
Study Three: Prompting-based In-context Learning
Study Four: Instruction Tuning
Approach 3: k-Instruction Tuning with Retrieval-Augmented Generation
...and 25 more sections

Figures (10)

Figure 1: A framework of LLM-based Automated Essay Scoring. Given an IELTS Writing Task 2 prompt (e.g., "Some people think computers should replace teachers") and a student essay response, the LLM acts as an examiner to produce (i) an overall band score (e.g., 6.5) and (ii) rubric-aligned feedback across Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy.
Figure 2: Overview of the four LLM-based AES paradigms evaluated in this study: (1) discriminative fine-tuning, (2) prompting-based inference, (3) instruction tuning with RAG, and (4) SFT with DPO and RAG.
Figure 3: Overview of the discriminative fine-tuning approach for IELTS essay scoring.
Figure 4: In-context learning paradigm for IELTS Writing Task 2 scoring, including prompting-based inference and instruction-tuned generative LLMs.
Figure 5: Overview of k-instruction tuning with Retrieval-Augmented Generation (RAG) for IELTS Writing Task 2 scoring. The model is instruction-tuned on criterion-specific subtasks using LoRA adapters, while external rubric descriptions and exemplar essays are retrieved to ground inference and reduce hallucination.
...and 5 more figures

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

TL;DR

Abstract

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Authors

TL;DR

Abstract

Table of Contents

Figures (10)