Table of Contents
Fetching ...

How Beginning Programmers and Code LLMs (Mis)read Each Other

Sydney Nguyen, Hannah McLean Babe, Yangtian Zi, Arjun Guha, Carolyn Jane Anderson, Molly Q Feldman

TL;DR

This study investigates how near-novice programmers (CS1 graduates) interact with Code LLMs in a controlled NL-to-code task. Using a large-scale, multi-institution lab design, the authors isolate prompt writing and prompt editing across 48 problems, with automatic correctness feedback to measure success. They find that beginners struggle to describe problems in natural language and to edit prompts effectively, with non-deterministic model outputs adding to the difficulty. The results highlight persistent gaps between novice mental models and Code LLM behavior, raise equity concerns for first-generation students, and argue that teaching explicit NL-to-code prompting and strong code understanding remains essential. Overall, Code LLMs are not a universal shortcut for novice programmers; careful design of tools, pedagogy, and evaluation methods is needed to unlock their educational potential.

Abstract

Generative AI models, specifically large language models (LLMs), have made strides towards the long-standing goal of text-to-code generation. This progress has invited numerous studies of user interaction. However, less is known about the struggles and strategies of non-experts, for whom each step of the text-to-code problem presents challenges: describing their intent in natural language, evaluating the correctness of generated code, and editing prompts when the generated code is incorrect. This paper presents a large-scale controlled study of how 120 beginning coders across three academic institutions approach writing and editing prompts. A novel experimental design allows us to target specific steps in the text-to-code process and reveals that beginners struggle with writing and editing prompts, even for problems at their skill level and when correctness is automatically determined. Our mixed-methods evaluation provides insight into student processes and perceptions with key implications for non-expert Code LLM use within and outside of education.

How Beginning Programmers and Code LLMs (Mis)read Each Other

TL;DR

This study investigates how near-novice programmers (CS1 graduates) interact with Code LLMs in a controlled NL-to-code task. Using a large-scale, multi-institution lab design, the authors isolate prompt writing and prompt editing across 48 problems, with automatic correctness feedback to measure success. They find that beginners struggle to describe problems in natural language and to edit prompts effectively, with non-deterministic model outputs adding to the difficulty. The results highlight persistent gaps between novice mental models and Code LLM behavior, raise equity concerns for first-generation students, and argue that teaching explicit NL-to-code prompting and strong code understanding remains essential. Overall, Code LLMs are not a universal shortcut for novice programmers; careful design of tools, pedagogy, and evaluation methods is needed to unlock their educational potential.

Abstract

Generative AI models, specifically large language models (LLMs), have made strides towards the long-standing goal of text-to-code generation. This progress has invited numerous studies of user interaction. However, less is known about the struggles and strategies of non-experts, for whom each step of the text-to-code problem presents challenges: describing their intent in natural language, evaluating the correctness of generated code, and editing prompts when the generated code is incorrect. This paper presents a large-scale controlled study of how 120 beginning coders across three academic institutions approach writing and editing prompts. A novel experimental design allows us to target specific steps in the text-to-code process and reveals that beginners struggle with writing and editing prompts, even for problems at their skill level and when correctness is automatically determined. Our mixed-methods evaluation provides insight into student processes and perceptions with key implications for non-expert Code LLM use within and outside of education.
Paper Structure (82 sections, 10 figures, 13 tables)

This paper contains 82 sections, 10 figures, 13 tables.

Figures (10)

  • Figure 1: Visualization of the multi-step process of querying a large language mode of code (Code LLM). The user starts with crafting their prompt in natural language (NL). They provide the prompt to the model, which produces code. The user then assesses the correctness of the generated code. If there are errors, they must identify how to resolve them and how to edit the prompt. This continues in an iterative fashion.
  • Figure 2: Study overview. (1) describes the overall student trajectory through the study. We split the post survey into two sections, divided by the semi-structured interview, to delay collecting demographic information to prevent self-bias. (2) outlines the 8 problem categories (4 timed versus 4 untimed) and the 6 problems per category. Students took individual trajectories through one problem in each category, as shown by the thin arrows. (3) showcases an example trajectory for students through the problems. Students spent, on average, 42.6 minutes (SD=10.6) completing the study, with an average of 26.6 minutes (SD=9.1) on the untimed section and 15.9 minutes (SD=3.3) on the timed section.
  • Figure 3: The Charlie the Coding Cow interface.
  • Figure 4: An overview of the experimental platform. For each problem, the frontend provides the participant with the signature and tests and asks them to write a description (prompt). This is then relayed to the backend, where the signature and prompt are sent to Codex via the API. The code completion from Codex is then run on our pre-defined tests. Finally, the results of running the tests and the code completion are presented to the participant in the frontend interface.
  • Figure 5: Basic measures of student success at the natural-language-to-code task. Success rate is the fraction of all attempts by a participant that succeed. Eventual success rate is the fraction of last attempts at a problem by a participant that succeed. Pass@1 resamples the LLM several times to estimate the probability of success. We present these measures by institution. \ref{['tab:success_pass_by_institutions']} presents the means. \ref{['success-rates-hist']} and \ref{['eventual-success-rates-hist']} show the distribution of (eventual) success rates. Eventual success rates are higher than success rates, which is to be expected: \ref{['attempts-hist']} shows that many students make several attempts at a problem before an eventual success or give up.
  • ...and 5 more figures