Comparing Code Explanations Created by Students and Large Language Models
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, Arto Hellas
TL;DR
<p>We study whether code explanations generated by a large language model (GPT-3) can match or exceed those produced by students in a large first-year course. Using two lab sessions, we compare explanations for three functions across accuracy, understandability, and length, with 54 explanations per function (27 student, 27 GPT-3) rated by peers and analyzed via Mann-Whitney U tests with Bonferroni correction. Results show LLM explanations are rated higher in understandability and accuracy, while lengths are similar, suggesting LLMs can provide scalable, high-quality code-explanation exemplars for novices. The study discusses how such explanations could be integrated into introductory programming education, while noting risks of over-reliance and limitations in generalizability to more advanced content.</p>
Abstract
Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs correlates strongly with code writing skills. However, developing the expertise to comprehend and explain code accurately and succinctly is a challenge for many students. Existing pedagogical approaches that scaffold the ability to explain code, such as producing exemplar code explanations on demand, do not currently scale well to large classrooms. The recent emergence of powerful large language models (LLMs) may offer a solution. In this paper, we explore the potential of LLMs in generating explanations that can serve as examples to scaffold students' ability to understand and explain code. To evaluate LLM-created explanations, we compare them with explanations created by students in a large course ($n \approx 1000$) with respect to accuracy, understandability and length. We find that LLM-created explanations, which can be produced automatically on demand, are rated as being significantly easier to understand and more accurate summaries of code than student-created explanations. We discuss the significance of this finding, and suggest how such models can be incorporated into introductory programming education.
