You're (Not) My Type -- Can LLMs Generate Feedback of Specific Types for Introductory Programming Tasks?
Dominic Lohr, Hieke Keuning, Natalie Kiesler
TL;DR
This work investigates how to generate specific, elaborated feedback types for introductory programming tasks using prompt engineering. Through five iterative prompt designs and GPT-4 evaluations on 11 student submissions across six exercises and four languages, the study shows that the targeted feedback type can be produced in the majority of cases (63/66) but also reveals risks of misleading or extraneous feedback. The authors advocate a hybrid approach that combines LLM-generated feedback with automated checks and human oversight to improve accuracy and usefulness. The findings inform educators and tool developers about practical guardrails and design considerations for AI-assisted feedback in programming education, and suggest avenues for future research on multi-type feedback generation and its learning impacts.
Abstract
Background: Feedback as one of the most influential factors for learning has been subject to a great body of research. It plays a key role in the development of educational technology systems and is traditionally rooted in deterministic feedback defined by experts and their experience. However, with the rise of generative AI and especially Large Language Models (LLMs), we expect feedback as part of learning systems to transform, especially for the context of programming. In the past, it was challenging to automate feedback for learners of programming. LLMs may create new possibilities to provide richer, and more individual feedback than ever before. Objectives: This paper aims to generate specific types of feedback for introductory programming tasks using LLMs. We revisit existing feedback taxonomies to capture the specifics of the generated feedback, such as randomness, uncertainty, and degrees of variation. Methods: We iteratively designed prompts for the generation of specific feedback types (as part of existing feedback taxonomies) in response to authentic student programs. We then evaluated the generated output and determined to what extent it reflected certain feedback types. Results and Conclusion: The present work provides a better understanding of different feedback dimensions and characteristics. The results have implications for future feedback research with regard to, for example, feedback effects and learners' informational needs. It further provides a basis for the development of new tools and learning systems for novice programmers including feedback generated by AI.
