Table of Contents
Fetching ...

SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents

Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva

TL;DR

This work introduces Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale without relying on extensive, real-world student assignments and teacher feedback.

Abstract

Providing high-quality feedback on student assignments is crucial for student success, but it is heavily limited by time and budgetary constraints. In this work, we introduce Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale without relying on extensive, real-world student assignments and teacher feedback. To obtain this type of data, two large language models (LLMs) operate in a teacher-student role to simulate assignment completion and formative feedback, generating 19.8K synthetic pairs of student work and corresponding critiques and actionable improvements from a teacher. With this data, we fine-tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high-quality, goal-oriented feedback. Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback quality. The potential for societal impact is reinforced by extensive qualitative comments and ratings from human stakeholders -- both students and higher education instructors. SEFL has the potential to transform feedback processes for higher education and beyond.

SEFL: A Framework for Generating Synthetic Educational Assignment Feedback with LLM Agents

TL;DR

This work introduces Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale without relying on extensive, real-world student assignments and teacher feedback.

Abstract

Providing high-quality feedback on student assignments is crucial for student success, but it is heavily limited by time and budgetary constraints. In this work, we introduce Synthetic Educational Feedback Loops (SEFL), a synthetic data framework designed to generate data that resembles immediate, on-demand feedback at scale without relying on extensive, real-world student assignments and teacher feedback. To obtain this type of data, two large language models (LLMs) operate in a teacher-student role to simulate assignment completion and formative feedback, generating 19.8K synthetic pairs of student work and corresponding critiques and actionable improvements from a teacher. With this data, we fine-tune smaller, more computationally efficient LLMs on these synthetic pairs, enabling them to replicate key features of high-quality, goal-oriented feedback. Through comprehensive evaluations with three LLM judges and three human experts, across a subset of 900 outputs, we demonstrate that SEFL-tuned models outperform both their untuned counterparts and an existing baseline in terms of feedback quality. The potential for societal impact is reinforced by extensive qualitative comments and ratings from human stakeholders -- both students and higher education instructors. SEFL has the potential to transform feedback processes for higher education and beyond.

Paper Structure

This paper contains 29 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: SEFL Synthetic Data Generation Setup. We use a two-agent framework wu2023autogen with LLMs acting as a Student and Teacher. The Teacher creates assignments from Fineweb-Edu lozhkov2024fineweb-edu, a dataset curated using LLMs to judge the educational value of web pages. Overall, the Student responds with explicit errors (via prompting), and finally, the Teacher addresses each mistake. This synthetic interaction data is then used to fine-tune multiple LLMs, whose performance is measured through human ratings and evaluations by LLMs-as-judges.
  • Figure 2: Win Rate Results. We show the win rate of our SEFL-tuned models. A win rate $>$50% indicates that SEFL-tuned models are better in giving feedback than their vanilla counterpart; in red everything $<$50% shows the opposite. We show results of 3 human annotators (H#) and 4 LLM judges: gpt-4o (J1), claude-3.5-sonnet (J2), and deepseek-v3 (J3).
  • Figure 3: Pairwise Cohen's $k$. We show the pairwise Cohen's $k$ between each LLM judge and annotator.
  • Figure 4: Qualitative Example of Feedback. Excerpt that shows how SEFL improves specificity and actionability. Full conversation will be added as supplementary material.
  • Figure 5: Optional Rater Comments by Category. AC = Actionability, GO = Goal-orientation, UF = User-friendliness, CO = Consistency, AY = Autonomy. Annotators were not required to leave a comment; they did so mainly when a response stood out (usually for a problem). We also show the 95% Wilson interval for the net balance; if it is not visible, it denotes zero comments. We show that SEFL-tuned models are getting more frequent positive (absolute) comments.
  • ...and 3 more figures