Table of Contents
Fetching ...

Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation

Tung Phung, Victor-Alexandru Pădurean, Anjali Singh, Christopher Brooks, José Cambronero, Sumit Gulwani, Adish Singla, Gustavo Soares

TL;DR

This paper investigates the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs and develops a novel technique, GPT4HINTS-GPT3.5VAL, which performs an automatic quality validation by simulating the potential utility of providing this feedback.

Abstract

Generative AI and large language models hold great promise in enhancing programming education by automatically generating individualized feedback for students. We investigate the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs. Recent works have benchmarked state-of-the-art models for various feedback generation scenarios; however, their overall quality is still inferior to human tutors and not yet ready for real-world deployment. In this paper, we seek to push the limits of generative AI models toward providing high-quality programming hints and develop a novel technique, GPT4Hints-GPT3.5Val. As a first step, our technique leverages GPT-4 as a ``tutor'' model to generate hints -- it boosts the generative quality by using symbolic information of failing test cases and fixes in prompts. As a next step, our technique leverages GPT-3.5, a weaker model, as a ``student'' model to further validate the hint quality -- it performs an automatic quality validation by simulating the potential utility of providing this feedback. We show the efficacy of our technique via extensive evaluation using three real-world datasets of Python programs covering a variety of concepts ranging from basic algorithms to regular expressions and data analysis using pandas library.

Automating Human Tutor-Style Programming Feedback: Leveraging GPT-4 Tutor Model for Hint Generation and GPT-3.5 Student Model for Hint Validation

TL;DR

This paper investigates the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs and develops a novel technique, GPT4HINTS-GPT3.5VAL, which performs an automatic quality validation by simulating the potential utility of providing this feedback.

Abstract

Generative AI and large language models hold great promise in enhancing programming education by automatically generating individualized feedback for students. We investigate the role of generative AI models in providing human tutor-style programming hints to help students resolve errors in their buggy programs. Recent works have benchmarked state-of-the-art models for various feedback generation scenarios; however, their overall quality is still inferior to human tutors and not yet ready for real-world deployment. In this paper, we seek to push the limits of generative AI models toward providing high-quality programming hints and develop a novel technique, GPT4Hints-GPT3.5Val. As a first step, our technique leverages GPT-4 as a ``tutor'' model to generate hints -- it boosts the generative quality by using symbolic information of failing test cases and fixes in prompts. As a next step, our technique leverages GPT-3.5, a weaker model, as a ``student'' model to further validate the hint quality -- it performs an automatic quality validation by simulating the potential utility of providing this feedback. We show the efficacy of our technique via extensive evaluation using three real-world datasets of Python programs covering a variety of concepts ranging from basic algorithms to regular expressions and data analysis using pandas library.
Paper Structure (14 sections, 10 figures)

This paper contains 14 sections, 10 figures.

Figures (10)

  • Figure 1: Illustrative example showcasing $\textsc{GPT4Hints-GPT3.5Val}$ for the Palindrome problem shown in (a) from the $\textsc{BasicAlgo}$ dataset. (b) shows a real-world buggy program. (c) shows a fixed program generated by the technique in an intermediate step, and (d) shows a test case where the buggy program fails to produce the correct output. (e) shows a detailed explanation generated by the technique that is used later in the validation stage. (f) shows the generated feedback (a single-sentence hint). (g) highlights that the validation stage of the technique successfully accepted the generated feedback as high-quality and suitable for sharing with the student.
  • Figure 2: Similar to Figure \ref{['fig.illustration_palindrome_p6']}, this example showcases $\textsc{GPT4Hints-GPT3.5Val}$ on a buggy program from the $\textsc{DataAnalysis}$ dataset.
  • Figure 3: Illustration of different stages in $\textsc{GPT4Hints-GPT3.5Val}$'s feedback generation process.
  • Figure 4: Prompts employed by $\textsc{GPT4Hints-GPT3.5Val}$ for feedback generation (first) and feedback validation (second and third).
  • Figure 5: Overview of the datasets used in this work. See Section \ref{['sec.experiments.datasets']} for details.
  • ...and 5 more figures