AI-assisted coding: Experiments with GPT-4
Russell A Poldrack, Thomas Lu, Gašper Beguš
TL;DR
The paper investigates GPT-4's capability to assist in coding across data-science tasks through interactive coding, code refactoring, and test generation. It finds that while GPT-4 can produce usable code and improve code quality, it often requires substantial human validation and debugging, especially for domain-specific calculations. Refactoring with GPT-4 reduces lint warnings and improves maintainability metrics, and automatic test generation yields high coverage but frequently fails, necessitating human diagnosis of test versus code issues. The work highlights the complementary role of human-in-the-loop in AI-assisted coding and provides a reproducible workflow for future exploration.
Abstract
Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
