Table of Contents
Fetching ...

Validating AI-Generated Code with Live Programming

Kasra Ferdowsi, Ruanqianqian Huang, Michael B. James, Nadia Polikarpova, Sorin Lerner

TL;DR

This work tackles validating AI-generated code by introducing Live Programming (LP) as a mechanism to lower the cost of validation by execution. The authors built Leap, a Python environment that integrates an AI code assistant with projection-based live runtime visualization, and evaluated it via a between-subjects study (N=$17$). Results show LP reduces validation effort and cognitive load for certain tasks, helps prevent over- and under-reliance on AI suggestions, and aids users in distinguishing between multiple code alternatives. The findings suggest LP can strengthen trust calibration and decision-making in AI-assisted programming, with implications for broader AI-LP integrations and debugging workflows.

Abstract

AI-powered programming assistants are increasingly gaining popularity, with GitHub Copilot alone used by over a million developers worldwide. These tools are far from perfect, however, producing code suggestions that may be incorrect in subtle ways. As a result, developers face a new challenge: validating AI's suggestions. This paper explores whether Live Programming (LP), a continuous display of a program's runtime values, can help address this challenge. To answer this question, we built a Python editor that combines an AI-powered programming assistant with an existing LP environment. Using this environment in a between-subjects study (N=17), we found that by lowering the cost of validation by execution, LP can mitigate over- and under-reliance on AI-generated programs and reduce the cognitive load of validation for certain types of tasks.

Validating AI-Generated Code with Live Programming

TL;DR

This work tackles validating AI-generated code by introducing Live Programming (LP) as a mechanism to lower the cost of validation by execution. The authors built Leap, a Python environment that integrates an AI code assistant with projection-based live runtime visualization, and evaluated it via a between-subjects study (N=). Results show LP reduces validation effort and cognitive load for certain tasks, helps prevent over- and under-reliance on AI suggestions, and aids users in distinguishing between multiple code alternatives. The findings suggest LP can strengthen trust calibration and decision-making in AI-assisted programming, with implications for broader AI-LP integrations and debugging workflows.

Abstract

AI-powered programming assistants are increasingly gaining popularity, with GitHub Copilot alone used by over a million developers worldwide. These tools are far from perfect, however, producing code suggestions that may be incorrect in subtle ways. As a result, developers face a new challenge: validating AI's suggestions. This paper explores whether Live Programming (LP), a continuous display of a program's runtime values, can help address this challenge. To answer this question, we built a Python editor that combines an AI-powered programming assistant with an existing LP environment. Using this environment in a between-subjects study (N=17), we found that by lowering the cost of validation by execution, LP can mitigate over- and under-reliance on AI-generated programs and reduce the cognitive load of validation for certain types of tasks.
Paper Structure (31 sections, 4 figures)

This paper contains 31 sections, 4 figures.

Figures (4)

  • Figure 1: Leap is a Python environment that enables validating AI-generated code suggestions via Live Programming. Users prompt the AI assistant via comments and/or code context. The Suggestion Panel shows the AI-generated suggestions. Pressing a Preview button inserts the suggestion into the editor. Users can inspect the runtime behavior of the suggestion in Projection Boxes lerner2020pb, which are updated continuously as the user edits the code.
  • Figure 2: Success in validating AI suggestions across groups for Fixed-Prompt tasks. "Completed" means the participant submitted a solution they were satisfied with by the time limit, and "Timeout" means they did not. We deem the validation successful if a participant submitted a correct solution (dark blue) or timed out when attempting to fix the correctly identified bugs in their chosen suggestion (light blue).
  • Figure 3: Percentage of time spent in the Suggestion Panel across the two groups for Fixed-Prompt tasks.
  • Figure 4: NASA Task Load Index (TLX) results for the Fixed-Prompt tasks: Bigram on the left, and Pandas on the right. Higher scores indicate higher cognitive load (in case of Performance this means higher failure rate).