Validating AI-Generated Code with Live Programming

Kasra Ferdowsi; Ruanqianqian Huang; Michael B. James; Nadia Polikarpova; Sorin Lerner

Validating AI-Generated Code with Live Programming

Kasra Ferdowsi, Ruanqianqian Huang, Michael B. James, Nadia Polikarpova, Sorin Lerner

TL;DR

This work tackles validating AI-generated code by introducing Live Programming (LP) as a mechanism to lower the cost of validation by execution. The authors built Leap, a Python environment that integrates an AI code assistant with projection-based live runtime visualization, and evaluated it via a between-subjects study (N=$17$). Results show LP reduces validation effort and cognitive load for certain tasks, helps prevent over- and under-reliance on AI suggestions, and aids users in distinguishing between multiple code alternatives. The findings suggest LP can strengthen trust calibration and decision-making in AI-assisted programming, with implications for broader AI-LP integrations and debugging workflows.

Abstract

AI-powered programming assistants are increasingly gaining popularity, with GitHub Copilot alone used by over a million developers worldwide. These tools are far from perfect, however, producing code suggestions that may be incorrect in subtle ways. As a result, developers face a new challenge: validating AI's suggestions. This paper explores whether Live Programming (LP), a continuous display of a program's runtime values, can help address this challenge. To answer this question, we built a Python editor that combines an AI-powered programming assistant with an existing LP environment. Using this environment in a between-subjects study (N=17), we found that by lowering the cost of validation by execution, LP can mitigate over- and under-reliance on AI-generated programs and reduce the cognitive load of validation for certain types of tasks.

Validating AI-Generated Code with Live Programming

TL;DR

). Results show LP reduces validation effort and cognitive load for certain tasks, helps prevent over- and under-reliance on AI suggestions, and aids users in distinguishing between multiple code alternatives. The findings suggest LP can strengthen trust calibration and decision-making in AI-assisted programming, with implications for broader AI-LP integrations and debugging workflows.

Abstract

Paper Structure (31 sections, 4 figures)

This paper contains 31 sections, 4 figures.

Introduction
Related Work
Validation of AI-Generated Code
Validation in Program Synthesis
Live Programming
LEAP: The Tool Used in the Study
Example Usage
Implementation
User Study
Tasks
Participants and Groups
Procedure and Data
Results
RQ1: Over- And Under-Reliance on AI
We found six instances of unsuccessful validation, all from the No-LP group
...and 16 more sections

Figures (4)

Figure 1: Leap is a Python environment that enables validating AI-generated code suggestions via Live Programming. Users prompt the AI assistant via comments and/or code context. The Suggestion Panel shows the AI-generated suggestions. Pressing a Preview button inserts the suggestion into the editor. Users can inspect the runtime behavior of the suggestion in Projection Boxes lerner2020pb, which are updated continuously as the user edits the code.
Figure 2: Success in validating AI suggestions across groups for Fixed-Prompt tasks. "Completed" means the participant submitted a solution they were satisfied with by the time limit, and "Timeout" means they did not. We deem the validation successful if a participant submitted a correct solution (dark blue) or timed out when attempting to fix the correctly identified bugs in their chosen suggestion (light blue).
Figure 3: Percentage of time spent in the Suggestion Panel across the two groups for Fixed-Prompt tasks.
Figure 4: NASA Task Load Index (TLX) results for the Fixed-Prompt tasks: Bigram on the left, and Pandas on the right. Higher scores indicate higher cognitive load (in case of Performance this means higher failure rate).

Validating AI-Generated Code with Live Programming

TL;DR

Abstract

Validating AI-Generated Code with Live Programming

Authors

TL;DR

Abstract

Table of Contents

Figures (4)