A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course
Will Yeadon, Alex Peach, Craig P. Testrow
TL;DR
The study evaluates GPT-3.5 and GPT-4, with and without prompt engineering, in a university-level physics Python coding course using blinded assessment of AI- and student-authored submissions. It systematically compares 100 AI-generated submissions to 50 student submissions across six submission categories, yielding 300 scoring data points from three markers. Results show that student work average ($91.1 ext{ extperthousand}%$) exceeds the best AI category (GPT-4 with prompt engineering at $81.1 ext{ extperthousand}%$) with a highly significant difference ($p=2.482 imes10^{-10}$); prompt engineering significantly improves GPT-4 ($p=1.661 imes10^{-4}$) and GPT-3.5 ($p=4.967 imes10^{-9}$). The authors demonstrate that AI-generated content remains distinguishable by human evaluators, while also highlighting the potential for AI to augment learning via targeted prompt strategies, informing future assessment design and human–AI collaboration in coding education.
Abstract
This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, both with and without prompt engineering, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language. Comparing 50 student submissions to 50 AI-generated submissions across different categories, and marked blindly by three independent markers, we amassed $n = 300$ data points. Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 \times 10^{-10}$). Prompt engineering significantly improved scores for both GPT-4 (p = $1.661 \times 10^{-4}$) and GPT-3.5 (p = $4.967 \times 10^{-9}$). Additionally, the blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from `Definitely AI' to `Definitely Human'. They accurately identified the authorship, with 92.1% of the work categorized as 'Definitely Human' being human-authored. Simplifying this to a binary `AI' or `Human' categorization resulted in an average accuracy rate of 85.3%. These findings suggest that while AI-generated work closely approaches the quality of university students' work, it often remains detectable by human evaluators.
