Validating AI-Generated Code with Live Programming
Kasra Ferdowsi, Ruanqianqian Huang, Michael B. James, Nadia Polikarpova, Sorin Lerner
TL;DR
This work tackles validating AI-generated code by introducing Live Programming (LP) as a mechanism to lower the cost of validation by execution. The authors built Leap, a Python environment that integrates an AI code assistant with projection-based live runtime visualization, and evaluated it via a between-subjects study (N=$17$). Results show LP reduces validation effort and cognitive load for certain tasks, helps prevent over- and under-reliance on AI suggestions, and aids users in distinguishing between multiple code alternatives. The findings suggest LP can strengthen trust calibration and decision-making in AI-assisted programming, with implications for broader AI-LP integrations and debugging workflows.
Abstract
AI-powered programming assistants are increasingly gaining popularity, with GitHub Copilot alone used by over a million developers worldwide. These tools are far from perfect, however, producing code suggestions that may be incorrect in subtle ways. As a result, developers face a new challenge: validating AI's suggestions. This paper explores whether Live Programming (LP), a continuous display of a program's runtime values, can help address this challenge. To answer this question, we built a Python editor that combines an AI-powered programming assistant with an existing LP environment. Using this environment in a between-subjects study (N=17), we found that by lowering the cost of validation by execution, LP can mitigate over- and under-reliance on AI-generated programs and reduce the cognitive load of validation for certain types of tasks.
