Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Chenyu Zhang; Xiaohang Luo

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Chenyu Zhang, Xiaohang Luo

TL;DR

The paper tackles the persistent challenge of providing timely, high-quality formative feedback at scale in education, especially for historically marginalized learners. It introduces a theory-grounded, role-based multi-agent pipeline in which five GPT-4o agents collaboratively produce rubric-aligned scores and bias-aware, concise feedback, with explicit fairness checks across learner proficiency bands. In a 12-session AI-literacy course, the system achieves near-expert scoring fidelity and receives favorable judgments from trained graders regarding usefulness and alignment with instructional goals, while maintaining sub-minute latency and low cost. The work demonstrates the practicality of scalable, equitable feedback and points to future extensions in multilingual contexts, broader settings, and robust safeguards for responsible AI-assisted learning.

Abstract

Formative feedback is widely recognized as one of the most effective drivers of student learning, yet it remains difficult to implement equitably at scale. In large or low-resource courses, instructors often lack the time, staffing, and bandwidth required to review and respond to every student reflection, creating gaps in support precisely where learners would benefit most. This paper presents a theory-grounded system that uses five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and to generate short, bias-aware, learner-facing comments. The agents first produce structured rubric scores, then check for potentially biased or exclusionary language, add metacognitive prompts that invite students to think about their own thinking, and finally compose a concise feedback message of at most 120 words. The system includes simple fairness checks that compare scoring error across lower and higher scoring learners, enabling instructors to monitor and bound disparities in accuracy. We evaluate the pipeline in a 12-session AI literacy program with adult learners. In this setting, the system produces rubric scores that approach expert-level agreement, and trained graders rate the AI-generated comments as helpful, empathetic, and well aligned with instructional goals. Taken together, these results show that multi-agent LLM systems can deliver equitable, high-quality formative feedback at a scale and speed that would be impossible for human graders alone. More broadly, the work points toward a future where feedback-rich learning becomes feasible for any course size or context, advancing long-standing goals of equity, access, and instructional capacity in education.

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

TL;DR

Abstract

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)