Table of Contents
Fetching ...

Substance Beats Style: Why Beginning Students Fail to Code with LLMs

Francesca Lucchetti, Zixuan Wu, Arjun Guha, Molly Q Feldman, Carolyn Jane Anderson

TL;DR

It is found that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more.

Abstract

Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.

Substance Beats Style: Why Beginning Students Fail to Code with LLMs

TL;DR

It is found that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more.

Abstract

Although LLMs are increasing the productivity of professional programmers, existing work shows that beginners struggle to prompt LLMs to solve text-to-code tasks. Why is this the case? This paper explores two competing hypotheses about the cause of student-LLM miscommunication: (1) students simply lack the technical vocabulary needed to write good prompts, and (2) students do not understand the extent of information that LLMs need to solve code generation tasks. We study (1) with a causal intervention experiment on technical vocabulary and (2) by analyzing graphs that abstract how students edit prompts and the different failures that they encounter. We find that substance beats style: a poor grasp of technical vocabulary is merely correlated with prompt failure; that the information content of prompts predicts success; that students get stuck making trivial edits; and more. Our findings have implications for the use of LLMs in programming education, and for efforts to make computing more accessible with LLMs.

Paper Structure

This paper contains 43 sections, 48 figures, 30 tables.

Figures (48)

  • Figure 1: An example problem that a student solves in two attempts. Given the function signature and tests, they write the first docstring. The platform prompts the model to generate the function body from the function signature and docstring (not the tests), and then tests the generated code. From the failed tests, the student realizes that the model needs to be told to round to two decimal places. They add this clue in the second prompt, which succeeds.
  • Figure 2: An example of tagging and then substituting "integer" with "whole number".
  • Figure 3: The graph of prompt trajectories for total_bill (\ref{['dataset-example']}). We highlight the trajectory of S23 who ultimately fails: their first prompts 1 has most clues, but omits Clue #7 (bottom right of figure). Their next prompt 2 is a trivial change. 3 adds detail about the list structure (Clue #2), but it was already described well so they cycle back to a previous state. Finally, 4 adds the missing Clue #4 (and deletes Clue #5, but it isn't necessary to solve the problem). Here they give up and fail, but many others succeed from this state after adding Clue #7.
  • Figure 4: Differences between pass@1 rates before and after lexical substitutions. A negative mean difference represents a decrease in performance after substitution.
  • Figure 5: Variable/concept confusion.
  • ...and 43 more figures