Table of Contents
Fetching ...

Comparing Code Explanations Created by Students and Large Language Models

Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, Arto Hellas

TL;DR

<p>We study whether code explanations generated by a large language model (GPT-3) can match or exceed those produced by students in a large first-year course. Using two lab sessions, we compare explanations for three functions across accuracy, understandability, and length, with 54 explanations per function (27 student, 27 GPT-3) rated by peers and analyzed via Mann-Whitney U tests with Bonferroni correction. Results show LLM explanations are rated higher in understandability and accuracy, while lengths are similar, suggesting LLMs can provide scalable, high-quality code-explanation exemplars for novices. The study discusses how such explanations could be integrated into introductory programming education, while noting risks of over-reliance and limitations in generalizability to more advanced content.</p>

Abstract

Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs correlates strongly with code writing skills. However, developing the expertise to comprehend and explain code accurately and succinctly is a challenge for many students. Existing pedagogical approaches that scaffold the ability to explain code, such as producing exemplar code explanations on demand, do not currently scale well to large classrooms. The recent emergence of powerful large language models (LLMs) may offer a solution. In this paper, we explore the potential of LLMs in generating explanations that can serve as examples to scaffold students' ability to understand and explain code. To evaluate LLM-created explanations, we compare them with explanations created by students in a large course ($n \approx 1000$) with respect to accuracy, understandability and length. We find that LLM-created explanations, which can be produced automatically on demand, are rated as being significantly easier to understand and more accurate summaries of code than student-created explanations. We discuss the significance of this finding, and suggest how such models can be incorporated into introductory programming education.

Comparing Code Explanations Created by Students and Large Language Models

TL;DR

<p>We study whether code explanations generated by a large language model (GPT-3) can match or exceed those produced by students in a large first-year course. Using two lab sessions, we compare explanations for three functions across accuracy, understandability, and length, with 54 explanations per function (27 student, 27 GPT-3) rated by peers and analyzed via Mann-Whitney U tests with Bonferroni correction. Results show LLM explanations are rated higher in understandability and accuracy, while lengths are similar, suggesting LLMs can provide scalable, high-quality code-explanation exemplars for novices. The study discusses how such explanations could be integrated into introductory programming education, while noting risks of over-reliance and limitations in generalizability to more advanced content.</p>

Abstract

Reasoning about code and explaining its purpose are fundamental skills for computer scientists. There has been extensive research in the field of computing education on the relationship between a student's ability to explain code and other skills such as writing and tracing code. In particular, the ability to describe at a high-level of abstraction how code will behave over all possible inputs correlates strongly with code writing skills. However, developing the expertise to comprehend and explain code accurately and succinctly is a challenge for many students. Existing pedagogical approaches that scaffold the ability to explain code, such as producing exemplar code explanations on demand, do not currently scale well to large classrooms. The recent emergence of powerful large language models (LLMs) may offer a solution. In this paper, we explore the potential of LLMs in generating explanations that can serve as examples to scaffold students' ability to understand and explain code. To evaluate LLM-created explanations, we compare them with explanations created by students in a large course () with respect to accuracy, understandability and length. We find that LLM-created explanations, which can be produced automatically on demand, are rated as being significantly easier to understand and more accurate summaries of code than student-created explanations. We discuss the significance of this finding, and suggest how such models can be incorporated into introductory programming education.
Paper Structure (20 sections, 3 figures, 2 tables)

This paper contains 20 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The three function definitions, as presented to students in Lab A. Students were asked to construct a short description of the intended purpose of each function.
  • Figure 2: Overview of the generation and sampling of code explanations. In Lab B, each student was allocated four code explanations to evaluate, selected at random from a pool of 54 code explanations (half of which were generated by students in Lab A, and half of which were generated by GPT-3.)
  • Figure 3: Distribution of student responses on LLM and student-generated code explanations being easy to understand and accurate summaries of code.