Table of Contents
Fetching ...

On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o

Rundong Liu, Andre Frade, Amal Vaidya, Maxime Labonne, Marcus Kaiser, Bismayan Chakrabarti, Jonathan Budd, Sean Moran

TL;DR

CodeQUEST presents a GPT-4o-powered framework that automatically evaluates and iteratively improves code quality across ten dimensions. It couples an Evaluator that produces dimension-wise scores and qualitative summaries with an Optimizer that enacts improvements in iterative cycles guided by the Evaluator. Empirical results on Python and JavaScript samples show strong alignment with static-analysis proxies and a mean relative improvement of $52.6\% \pm 17.9\%$ across instances, with most gains realized in the first iteration. The work demonstrates the potential of LLM-driven automation to enhance software development workflows while acknowledging variability and the need for robust validation.

Abstract

This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.

On Iterative Evaluation and Enhancement of Code Quality Using GPT-4o

TL;DR

CodeQUEST presents a GPT-4o-powered framework that automatically evaluates and iteratively improves code quality across ten dimensions. It couples an Evaluator that produces dimension-wise scores and qualitative summaries with an Optimizer that enacts improvements in iterative cycles guided by the Evaluator. Empirical results on Python and JavaScript samples show strong alignment with static-analysis proxies and a mean relative improvement of across instances, with most gains realized in the first iteration. The work demonstrates the potential of LLM-driven automation to enhance software development workflows while acknowledging variability and the need for robust validation.

Abstract

This paper introduces CodeQUEST, a novel framework leveraging Large Language Models (LLMs) to iteratively evaluate and enhance code quality across multiple dimensions, including readability, maintainability, efficiency, and security. The framework is divided into two main components: an Evaluator that assesses code quality across ten dimensions, providing both quantitative scores and qualitative summaries, and an Optimizer that iteratively improves the code based on the Evaluator's feedback. Our study demonstrates that CodeQUEST can effectively and robustly evaluate code quality, with its assessments aligning closely with established code quality metrics. Through a series of experiments using a curated dataset of Python and JavaScript examples, CodeQUEST demonstrated significant improvements in code quality, achieving a mean relative percentage improvement of 52.6%. The framework's evaluations were validated against a set of proxy metrics comprising of Pylint Score, Radon Maintainability Index, and Bandit output logs, showing a meaningful correlation. This highlights the potential of LLMs in automating code quality evaluation and improvement processes, presenting a significant advancement toward enhancing software development practices. The code implementation of the framework is available at: https://github.com/jpmorganchase/CodeQuest.

Paper Structure

This paper contains 24 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Schematic representation of CodeQUEST.
  • Figure 2: Comparison of the Baseline and CodeQUEST code quality scores. Each code example was evaluated five times. The mean and standard deviation were calculated across these five samples for the baseline (top) and CodeQUEST (bottom).
  • Figure 3: Quality score evolution of a Python code example (mbpp/927.py) when subjected to the Optimizer. The quality score is provided in its overall and dimension-specific format. Corresponding quality summaries and code snippets can be found in the Appendix \ref{['CodeQUEST - code evolution example']}
  • Figure 4: Distribution of code quality score improvements per iteration. (We disabled self-consistency in the experiment)
  • Figure 5: Left: Scatter plots and linear regression line for $\{\Delta_{x_n}{\text{Proxy}}$, $\Delta_{x_n}{\text{CodeQUEST}} \}_{x_n}$ (left) and $\{\Delta_{x_n}{\text{Proxy}}$, $\Delta_{x_n}{\text{Baseline}}\}_{x_n}$ (right). The shaded area represents the uncertainty of the linear regression fit.