Table of Contents
Fetching ...

Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

Kathrin Seßler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci

TL;DR

The study benchmarks LLM-generated feedback against practicing teachers and science-education experts on student experimentation protocols using a six-dimension framework (Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, Technical Terminology). It demonstrates that, on average, LLM feedback is comparable in overall quality to human feedback, but falls short in the Feed Back dimension, where contextual error explanation proves challenging. Qualitative analysis highlights the LLM's limited nuanced understanding and communication of specific errors, while length analyses suggest LLMs can provide concise, classroom-appropriate feedback. The findings argue for a teacher-in-the-loop approach to combine the efficiency of LLMs with educators' contextual insight, and point to future work involving more advanced models and multimodal feedback to enhance educational practice.

Abstract

Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

TL;DR

The study benchmarks LLM-generated feedback against practicing teachers and science-education experts on student experimentation protocols using a six-dimension framework (Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, Technical Terminology). It demonstrates that, on average, LLM feedback is comparable in overall quality to human feedback, but falls short in the Feed Back dimension, where contextual error explanation proves challenging. Qualitative analysis highlights the LLM's limited nuanced understanding and communication of specific errors, while length analyses suggest LLMs can provide concise, classroom-appropriate feedback. The findings argue for a teacher-in-the-loop approach to combine the efficiency of LLMs with educators' contextual insight, and point to future work involving more advanced models and multimodal feedback to enhance educational practice.

Abstract

Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However, this feedback often lacks the pedagogical validation provided by real-world practitioners. To address this limitation, our study evaluates and compares the feedback quality of LLM agents with that of human teachers and science education experts on student-written experimentation protocols. Four blinded raters, all professionals in scientific inquiry and science education, evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and 3) the science education experts using a five-point Likert scale based on six criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that LLM-generated feedback shows no significant difference to that of teachers and experts in overall quality. However, the LLM agent's performance lags in the Feed Back dimension, which involves identifying and explaining errors within the student's work context. Qualitative analysis highlighted the LLM agent's limitations in contextual understanding and in the clear communication of specific errors. Our findings suggest that combining LLM-generated feedback with human expertise can enhance educational practices by leveraging the efficiency of LLMs and the nuanced understanding of educators.

Paper Structure

This paper contains 28 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Prompt to generate feedback texts on all occurred error in the student experimental protocols.
  • Figure 2: Distribution of the rating scores of teachers, experts and LLM agent averaged across all six dimensions on feedback quality.
  • Figure 3: Average score and standard deviation of the scoring of the feedback texts generated by teachers, experts and LLM agent in each rating category. Significant differences are marked by $*$ ($p<0.05$) and $**$ ($p<0.01$).
  • Figure 4: Distribution of the number of words in each feedback text written by the teachers, experts and the LLM agent.