Table of Contents
Fetching ...

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

Priscylla Silva, Evandro Costa

TL;DR

This paper investigates whether large language models can reliably generate feedback for programming problem solving. It compares GPT-4o, GPT-4o-mini, GPT-4-Turbo, and Gemini-1.5-pro on a benchmark of 45 Python solutions, using a common prompt and zero temperature. The study finds that 63% of feedback hints were accurate and complete, while 37% contained errors or hallucinations, revealing both potential and limitations of current LLMs for educational feedback. The authors release the benchmark dataset and materials to support reproducibility and future improvements in automated feedback systems for programming education.

Abstract

Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63\% of feedback hints were accurate and complete, while 37\% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

TL;DR

This paper investigates whether large language models can reliably generate feedback for programming problem solving. It compares GPT-4o, GPT-4o-mini, GPT-4-Turbo, and Gemini-1.5-pro on a benchmark of 45 Python solutions, using a common prompt and zero temperature. The study finds that 63% of feedback hints were accurate and complete, while 37% contained errors or hallucinations, revealing both potential and limitations of current LLMs for educational feedback. The authors release the benchmark dataset and materials to support reproducibility and future improvements in automated feedback systems for programming education.

Abstract

Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63\% of feedback hints were accurate and complete, while 37\% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.

Paper Structure

This paper contains 8 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Prompt used to ask for feedback.
  • Figure 2: Confusion matrices for GPT-4o, GPT-4o-mini, and GPT-4-Turbo, showing the models' performance in classifying correct and incorrect student solutions.
  • Figure 3: Comparison of the frequency of feedback categories generated by each model.