Table of Contents
Fetching ...

REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour

Fares Fawzi, Seyed Parsa Neshaei, Marta Knezevic, Tanya Nazaretsky, Tanja Käser

Abstract

Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one-way artifact, offering limited support for interpretation, clarification, or follow-up. In this work, we introduce REFINE, a locally deployable, multi-agent feedback system built on small, open-source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically-grounded feedback generation agent with an LLM-as-a-judge-guided regeneration loop using a human-aligned judge, and a self-reflective tool-calling interactive agent that supports student follow-up questions with context-aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge-guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high-quality responses comparable to a state-of-the-art closed-source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system-generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi-agent, tool-augmented feedback systems for scalable, interactive feedback.

REFINE: Real-world Exploration of Interactive Feedback and Student Behaviour

Abstract

Formative feedback is central to effective learning, yet providing timely, individualised feedback at scale remains a persistent challenge. While recent work has explored the use of large language models (LLMs) to automate feedback, most existing systems still conceptualise feedback as a static, one-way artifact, offering limited support for interpretation, clarification, or follow-up. In this work, we introduce REFINE, a locally deployable, multi-agent feedback system built on small, open-source LLMs that treats feedback as an interactive process. REFINE combines a pedagogically-grounded feedback generation agent with an LLM-as-a-judge-guided regeneration loop using a human-aligned judge, and a self-reflective tool-calling interactive agent that supports student follow-up questions with context-aware, actionable responses. We evaluate REFINE through controlled experiments and an authentic classroom deployment in an undergraduate computer science course. Automatic evaluations show that judge-guided regeneration significantly improves feedback quality, and that the interactive agent produces efficient, high-quality responses comparable to a state-of-the-art closed-source model. Analysis of real student interactions further reveals distinct engagement patterns and indicates that system-generated feedback systematically steers subsequent student inquiry. Our findings demonstrate the feasibility and effectiveness of multi-agent, tool-augmented feedback systems for scalable, interactive feedback.

Paper Structure

This paper contains 15 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Interactive feedback workflow and system overview.(Top) Classroom workflow where students submit handwritten solutions, receive structured feedback, and ask questions. (Bottom) REFINE multi-agent system, where feedback is iteratively refined by a human-aligned LLM judge and paired with a trained tool-calling agent that answers via closed-loop, self-reflective reasoning.
  • Figure 2: (Left) Percentage of positive Feedback Judge judgments on $D_{fb}$ across rubric dimensions before (solid bars) and after (hashed bars) a judge-guided feedback refinement step. (Right) Percentage of positive Feedback Judge judgments on $\mathcal{D}_{study}^{feedback}$ and $\mathcal{D}_{prep}^{feedback}$ with feedback generation model (Qwen3-30B-Thinking) after one iteration on Current-State and Task Next Steps correctness.
  • Figure 3: (Left) Percentage of positive judgments on dataset $\mathcal{D}_{int}^{test}$ by evaluation criterion. Qwen3-8B REFINE is hashed. (Right) Tool-mediated interaction efficiency on $\mathcal{D}_{int}^{test}$. Extra-step rate is the proportion of queries requiring more than the nominal two-step trajectory.
  • Figure 4: (Left) Question category use. Bars show the percentage of students who asked at least one question in category (in-class: $n{=}39$, exam-preparation: $n{=}21$). Counts above bars show number of students per condition. (Right) Distribution of task scores for students who asked at least one question in each category.