Table of Contents
Fetching ...

Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant

Marc Ballestero-Ribó, Daniel Ortiz-Martínez

TL;DR

This paper evaluates using ChatGPT as a teaching assistant for introductory programming, comparing GPT-3.5T and GPT-4T on five Python problems. It introduces an in-context learning prompting strategy with chain-of-thought structure to generate feedback that can be automatically analyzed, enabling a lower-bound assessment of erroneous feedback and automated evaluation. The findings show GPT-4T outperforms GPT-3.5T but can still produce incorrect or irrelevant information, underscoring safety concerns for real-world deployment. The work also demonstrates a practical path toward operation of LLM-based programming tutors, including automated feedback handling, structured responses, and avenues for quality estimation and scalable evaluation in educational settings.

Abstract

The dream of achieving a student-teacher ratio of 1:1 is closer than ever thanks to the emergence of large language models (LLMs). One potential application of these models in the educational field would be to provide feedback to students in university introductory programming courses, so that a student struggling to solve a basic implementation problem could seek help from an LLM available 24/7. This article focuses on studying three aspects related to such an application. First, the performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated. The empirical results showed that GPT-4T performs much better than GPT-3.5T, however, it is not yet ready for use in a real-world scenario. This is due to the possibility of generating incorrect information that potential users may not always be able to detect. Second, the article proposes a carefully designed prompt using in-context learning techniques that allows automating important parts of the evaluation process, as well as providing a lower bound for the fraction of feedbacks containing incorrect information, saving time and effort. This was possible because the resulting feedback has a programmatically analyzable structure that incorporates diagnostic information about the LLM's performance in solving the requested task. Third, the article also suggests a possible strategy for implementing a practical learning tool based on LLMs, which is rooted on the proposed prompting techniques. This strategy opens up a whole range of interesting possibilities from a pedagogical perspective.

Prompt-Based Cost-Effective Evaluation and Operation of ChatGPT as a Computer Programming Teaching Assistant

TL;DR

This paper evaluates using ChatGPT as a teaching assistant for introductory programming, comparing GPT-3.5T and GPT-4T on five Python problems. It introduces an in-context learning prompting strategy with chain-of-thought structure to generate feedback that can be automatically analyzed, enabling a lower-bound assessment of erroneous feedback and automated evaluation. The findings show GPT-4T outperforms GPT-3.5T but can still produce incorrect or irrelevant information, underscoring safety concerns for real-world deployment. The work also demonstrates a practical path toward operation of LLM-based programming tutors, including automated feedback handling, structured responses, and avenues for quality estimation and scalable evaluation in educational settings.

Abstract

The dream of achieving a student-teacher ratio of 1:1 is closer than ever thanks to the emergence of large language models (LLMs). One potential application of these models in the educational field would be to provide feedback to students in university introductory programming courses, so that a student struggling to solve a basic implementation problem could seek help from an LLM available 24/7. This article focuses on studying three aspects related to such an application. First, the performance of two well-known models, GPT-3.5T and GPT-4T, in providing feedback to students is evaluated. The empirical results showed that GPT-4T performs much better than GPT-3.5T, however, it is not yet ready for use in a real-world scenario. This is due to the possibility of generating incorrect information that potential users may not always be able to detect. Second, the article proposes a carefully designed prompt using in-context learning techniques that allows automating important parts of the evaluation process, as well as providing a lower bound for the fraction of feedbacks containing incorrect information, saving time and effort. This was possible because the resulting feedback has a programmatically analyzable structure that incorporates diagnostic information about the LLM's performance in solving the requested task. Third, the article also suggests a possible strategy for implementing a practical learning tool based on LLMs, which is rooted on the proposed prompting techniques. This strategy opens up a whole range of interesting possibilities from a pedagogical perspective.

Paper Structure

This paper contains 35 sections, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Prompt template for analysis of computer programming assignments.