Table of Contents
Fetching ...

AI-assisted coding: Experiments with GPT-4

Russell A Poldrack, Thomas Lu, Gašper Beguš

TL;DR

The paper investigates GPT-4's capability to assist in coding across data-science tasks through interactive coding, code refactoring, and test generation. It finds that while GPT-4 can produce usable code and improve code quality, it often requires substantial human validation and debugging, especially for domain-specific calculations. Refactoring with GPT-4 reduces lint warnings and improves maintainability metrics, and automatic test generation yields high coverage but frequently fails, necessitating human diagnosis of test versus code issues. The work highlights the complementary role of human-in-the-loop in AI-assisted coding and provides a reproducible workflow for future exploration.

Abstract

Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.

AI-assisted coding: Experiments with GPT-4

TL;DR

The paper investigates GPT-4's capability to assist in coding across data-science tasks through interactive coding, code refactoring, and test generation. It finds that while GPT-4 can produce usable code and improve code quality, it often requires substantial human validation and debugging, especially for domain-specific calculations. Refactoring with GPT-4 reduces lint warnings and improves maintainability metrics, and automatic test generation yields high coverage but frequently fails, necessitating human diagnosis of test versus code issues. The work highlights the complementary role of human-in-the-loop in AI-assisted coding and provides a reproducible workflow for future exploration.

Abstract

Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
Paper Structure (7 sections, 10 figures)

This paper contains 7 sections, 10 figures.

Figures (10)

  • Figure 1: Proportion of successful code outcomes as a function of number of prompts. NS: not successful.
  • Figure 2: Number of Flake8 messages (per line of code) for original github files and refactored files.
  • Figure 3: Prevalence of individual Flake8 warning/error codes for original github files and refactored files. Values are sorted by prevalence in the original github files.
  • Figure 4: Logical lines of code
  • Figure 5: Comments
  • ...and 5 more figures