Table of Contents
Fetching ...

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

Maja Stahl, Leon Biermann, Andreas Nehring, Henning Wachsmuth

TL;DR

This paper investigates how large language models (LLMs) can be prompted to jointly perform automated essay scoring (AES) and generate constructive feedback for student essays. It analyzes zero-shot and few-shot prompting, including base and persona-based prompts, as well as Chain-of-Thought–inspired techniques, to determine their effects on AES performance and feedback helpfulness. Using the ASAP dataset, the study finds that tackling AES and feedback generation together can improve scoring accuracy, while feedback quality benefits from task-aware prompting, though scoring has a limited overall influence on the usefulness of feedback. The work offers practical guidance for designing LLM-based writing-support systems in education and highlights both the potential and the limitations of automated essay feedback in real-world settings.

Abstract

Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.

Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation

TL;DR

This paper investigates how large language models (LLMs) can be prompted to jointly perform automated essay scoring (AES) and generate constructive feedback for student essays. It analyzes zero-shot and few-shot prompting, including base and persona-based prompts, as well as Chain-of-Thought–inspired techniques, to determine their effects on AES performance and feedback helpfulness. Using the ASAP dataset, the study finds that tackling AES and feedback generation together can improve scoring accuracy, while feedback quality benefits from task-aware prompting, though scoring has a limited overall influence on the usefulness of feedback. The work offers practical guidance for designing LLM-based writing-support systems in education and highlights both the potential and the limitations of automated essay feedback in real-world settings.

Abstract

Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.
Paper Structure (26 sections, 2 figures, 13 tables)

This paper contains 26 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 1: Exemplary student essay on library censorship from the ASAP dataset hamner-etal-2012-hewlett along with feedback and essay score generated by one of the methods evaluated in this paper. Explicit connections of the feedback to essay parts are color-coded.
  • Figure 2: Overview of the main points of variation in our approach to predict a score and to generate feedback for a student essay: (a) Prompt pattern: Use of the base pattern or persona-specific pattern; (b) Task instruction type: Tasks to be tackled and their ordering; (c) In-context learning approach: Number of examples to learn from.