A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

Will Yeadon; Alex Peach; Craig P. Testrow

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

Will Yeadon, Alex Peach, Craig P. Testrow

TL;DR

The study evaluates GPT-3.5 and GPT-4, with and without prompt engineering, in a university-level physics Python coding course using blinded assessment of AI- and student-authored submissions. It systematically compares 100 AI-generated submissions to 50 student submissions across six submission categories, yielding 300 scoring data points from three markers. Results show that student work average ($91.1 ext{ extperthousand}%$) exceeds the best AI category (GPT-4 with prompt engineering at $81.1 ext{ extperthousand}%$) with a highly significant difference ($p=2.482 imes10^{-10}$); prompt engineering significantly improves GPT-4 ($p=1.661 imes10^{-4}$) and GPT-3.5 ($p=4.967 imes10^{-9}$). The authors demonstrate that AI-generated content remains distinguishable by human evaluators, while also highlighting the potential for AI to augment learning via targeted prompt strategies, informing future assessment design and human–AI collaboration in coding education.

Abstract

This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed $n = 300$ data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 \times 10^{-10}$). Prompt engineering significantly improved scores for both GPT-4 (p = $1.661 \times 10^{-4}$) and GPT-3.5 (p = $4.967 \times 10^{-9}$). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from `Definitely AI' to `Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary `AI' or `Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

TL;DR

) exceeds the best AI category (GPT-4 with prompt engineering at

) with a highly significant difference (

); prompt engineering significantly improves GPT-4 (

) and GPT-3.5 (

). The authors demonstrate that AI-generated content remains distinguishable by human evaluators, while also highlighting the potential for AI to augment learning via targeted prompt strategies, informing future assessment design and human–AI collaboration in coding education.

Abstract

data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p =

). Prompt engineering significantly improved scores for both GPT-4 (p =

) and GPT-3.5 (p =

). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from `Definitely AI' to `Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary `AI' or `Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.

Paper Structure (13 sections, 3 figures, 3 tables)

This paper contains 13 sections, 3 figures, 3 tables.

Introduction
Methodology
Overview
Coding Assignment
Generating the AI Code
Results
Score comparison
Author identification
Discussion
Overview and recommendations
Limitations
Conclusion
Breakdown of marks by marker

Figures (3)

Figure 1: Percent scores for each of the six categories of submission. Student submissions score the best thou they are closely followed by GPT-4 with prompt engineering and the Mixed student and AI work. GPT-3.5 performs strictly worse than GPT-4.
Figure 2: Histogram showing the markers’ assigned authorship versus actual authorship of the 300 assessed submissions. The amount of actual human-authored code in ’Definitely human’ is 92.1% , then 73.1% in ’Probably human’ followed by 22.4% in ’Probably AI’ and 8.4% in ’Definitely AI’.
Figure 3: Stacked histogram of the scores awarded by the three independent markers. Both the ANOVA and ICC models used find that the markers are consistent in their evaluations.

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

TL;DR

Abstract

A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course

Authors

TL;DR

Abstract

Table of Contents

Figures (3)