Table of Contents
Fetching ...

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen, Quang Nhut Huynh, Huy Tien Nguyen, Tung Le

TL;DR

This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks.

Abstract

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

TL;DR

This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks.

Abstract

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2
Paper Structure (40 sections, 21 equations, 10 figures, 7 tables)

This paper contains 40 sections, 21 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: A framework of LLM-based Automated Essay Scoring. Given an IELTS Writing Task 2 prompt (e.g., "Some people think computers should replace teachers") and a student essay response, the LLM acts as an examiner to produce (i) an overall band score (e.g., 6.5) and (ii) rubric-aligned feedback across Task Response, Coherence and Cohesion, Lexical Resource, and Grammatical Range and Accuracy.
  • Figure 2: Overview of the four LLM-based AES paradigms evaluated in this study: (1) discriminative fine-tuning, (2) prompting-based inference, (3) instruction tuning with RAG, and (4) SFT with DPO and RAG.
  • Figure 3: Overview of the discriminative fine-tuning approach for IELTS essay scoring.
  • Figure 4: In-context learning paradigm for IELTS Writing Task 2 scoring, including prompting-based inference and instruction-tuned generative LLMs.
  • Figure 5: Overview of k-instruction tuning with Retrieval-Augmented Generation (RAG) for IELTS Writing Task 2 scoring. The model is instruction-tuned on criterion-specific subtasks using LoRA adapters, while external rubric descriptions and exemplar essays are retrieved to ground inference and reduce hallucination.
  • ...and 5 more figures