Evaluating the Application of Large Language Models to Generate Feedback in Programming Education

Sven Jacobs; Steffen Jaschke

Evaluating the Application of Large Language Models to Generate Feedback in Programming Education

Sven Jacobs, Steffen Jaschke

TL;DR

The paper tackles scalable feedback in introductory programming by integrating GPT-4 into a web-based practice environment (Tutor Kai) to provide timely feedback without revealing solutions. It analyzes a semester-long study with 51 students, using GPT-4-0314 (temperature 0) to generate feedback and collecting student ratings to assess usefulness. The findings show that GPT-4 can identify most actual issues, but pitfalls include occasional incorrect or hallucinated suggestions and occasional code appearing in feedback. Practically, the work highlights deployment considerations in classrooms and points to future directions such as automated feedback taxonomy, framework development, and leveraging larger-context LLMs and RAG to enhance educational feedback.

Abstract

This study investigates the application of large language models, specifically GPT-4, to enhance programming education. The research outlines the design of a web application that uses GPT-4 to provide feedback on programming tasks, without giving away the solution. A web application for working on programming tasks was developed for the study and evaluated with 51 students over the course of one semester. The results show that most of the feedback generated by GPT-4 effectively addressed code errors. However, challenges with incorrect suggestions and hallucinated issues indicate the need for further improvements.

Evaluating the Application of Large Language Models to Generate Feedback in Programming Education

TL;DR

Abstract

Evaluating the Application of Large Language Models to Generate Feedback in Programming Education

Authors

TL;DR

Abstract

Table of Contents

Figures (2)