Table of Contents
Fetching ...

Feedback-Generation for Programming Exercises With GPT-4

Imen Azaiz, Natalie Kiesler, Sven Strickroth

TL;DR

The paper investigates whether GPT-4 Turbo can generate high-quality formative feedback for introductory programming exercises when given a task description and a student submission. Through a qualitative study of 55 authentic submissions across two Java assignments, with multiple GPT-4 Turbo runs and rigorous coding analysis, the authors characterize feedback content, structure, code representation, correctness, and stylistic guidance. They find GPT-4 Turbo produces personalized, often detailed and structured feedback that can fix many issues and produce working solutions, but it also exhibits inconsistencies, incomplete corrections, and occasional misclassifications, with accuracy generally higher for simpler tasks and still lower for more complex ones. The work highlights both the potential and the limitations of integrating GPT-4-based feedback into e-assessment systems, offering guidance for pedagogy, system design, and considerations around privacy and reliability while outlining directions for future research in tailored, offline, or instructor-assisted deployments.

Abstract

Ever since Large Language Models (LLMs) and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.

Feedback-Generation for Programming Exercises With GPT-4

TL;DR

The paper investigates whether GPT-4 Turbo can generate high-quality formative feedback for introductory programming exercises when given a task description and a student submission. Through a qualitative study of 55 authentic submissions across two Java assignments, with multiple GPT-4 Turbo runs and rigorous coding analysis, the authors characterize feedback content, structure, code representation, correctness, and stylistic guidance. They find GPT-4 Turbo produces personalized, often detailed and structured feedback that can fix many issues and produce working solutions, but it also exhibits inconsistencies, incomplete corrections, and occasional misclassifications, with accuracy generally higher for simpler tasks and still lower for more complex ones. The work highlights both the potential and the limitations of integrating GPT-4-based feedback into e-assessment systems, offering guidance for pedagogy, system design, and considerations around privacy and reliability while outlining directions for future research in tailored, offline, or instructor-assisted deployments.

Abstract

Ever since Large Language Models (LLMs) and related applications have become broadly available, several studies investigated their potential for assisting educators and supporting students in higher education. LLMs such as Codex, GPT-3.5, and GPT 4 have shown promising results in the context of large programming courses, where students can benefit from feedback and hints if provided timely and at scale. This paper explores the quality of GPT-4 Turbo's generated output for prompts containing both the programming task specification and a student's submission as input. Two assignments from an introductory programming course were selected, and GPT-4 was asked to generate feedback for 55 randomly chosen, authentic student programming submissions. The output was qualitatively analyzed regarding correctness, personalization, fault localization, and other features identified in the material. Compared to prior work and analyses of GPT-3.5, GPT-4 Turbo shows notable improvements. For example, the output is more structured and consistent. GPT-4 Turbo can also accurately identify invalid casing in student programs' output. In some cases, the feedback also includes the output of the student program. At the same time, inconsistent feedback was noted such as stating that the submission is correct but an error needs to be fixed. The present work increases our understanding of LLMs' potential, limitations, and how to integrate them into e-assessment systems, pedagogical scenarios, and instructing students who are using applications based on GPT-4.
Paper Structure (14 sections, 4 tables)