AI-assisted coding: Experiments with GPT-4

Russell A Poldrack; Thomas Lu; Gašper Beguš

AI-assisted coding: Experiments with GPT-4

Russell A Poldrack, Thomas Lu, Gašper Beguš

TL;DR

The paper investigates GPT-4's capability to assist in coding across data-science tasks through interactive coding, code refactoring, and test generation. It finds that while GPT-4 can produce usable code and improve code quality, it often requires substantial human validation and debugging, especially for domain-specific calculations. Refactoring with GPT-4 reduces lint warnings and improves maintainability metrics, and automatic test generation yields high coverage but frequently fails, necessitating human diagnosis of test versus code issues. The work highlights the complementary role of human-in-the-loop in AI-assisted coding and provides a reproducible workflow for future exploration.

Abstract

Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.

AI-assisted coding: Experiments with GPT-4

TL;DR

Abstract

Paper Structure (7 sections, 10 figures)

This paper contains 7 sections, 10 figures.

Introduction
Coding with GPT-4
Refactoring code using GPT4
Automatically generated code and tests
Conclusions
Acknowledgments
Appendix 1: Prompts used for interactive coding with GPT-4

Figures (10)

Figure 1: Proportion of successful code outcomes as a function of number of prompts. NS: not successful.
Figure 2: Number of Flake8 messages (per line of code) for original github files and refactored files.
Figure 3: Prevalence of individual Flake8 warning/error codes for original github files and refactored files. Values are sorted by prevalence in the original github files.
Figure 4: Logical lines of code
Figure 5: Comments
...and 5 more figures

AI-assisted coding: Experiments with GPT-4

TL;DR

Abstract

AI-assisted coding: Experiments with GPT-4

Authors

TL;DR

Abstract

Table of Contents

Figures (10)