Better Python Programming for all: With the focus on Maintainability

Karthik Shivashankar; Antonio Martini

Better Python Programming for all: With the focus on Maintainability

Karthik Shivashankar, Antonio Martini

TL;DR

This work addresses the gap in maintainingability for Python code generated by Code LLMs by fine-tuning models on a dedicated maintainability-focused dataset. It combines instruction tuning, parameter-efficient fine-tuning (PEFT) with QLoRA, and both open and closed models (WizardCoder13B and GPT-3.5) to generate refactored code that preserves functionality while reducing size and complexity, as measured by $SLOC$, $CC$, $HE$, and $MI$. Evaluation uses CodeBERTScore to assess functional similarity and Radon-based metrics for maintainability, supplemented by human judgments from 11 Python-expert participants. The results show measurable improvements in maintainability metrics and high functional similarity between generated and reference code, indicating that targeted maintainability objectives can be successfully integrated into AI-assisted code generation. The publicly released replication package and dataset underscore the practical impact for researchers and practitioners aiming to reduce technical debt and facilitate sustainable software development with AI-assisted tooling.

Abstract

This study aims to enhance the maintainability of code generated by Large Language Models (LLMs), with a focus on the Python programming language. As the use of LLMs for coding assistance grows, so do concerns about the maintainability of the code they produce. Previous research has mainly concentrated on the functional accuracy and testing success of generated code, overlooking aspects of maintainability. Our approach involves the use of a specifically designed dataset for training and evaluating the model, ensuring a thorough assessment of code maintainability. At the heart of our work is the fine-tuning of an LLM for code refactoring, aimed at enhancing code readability, reducing complexity, and improving overall maintainability. After fine-tuning an LLM to prioritize code maintainability, our evaluations indicate that this model significantly improves code maintainability standards, suggesting a promising direction for the future of AI-assisted software development.

Better Python Programming for all: With the focus on Maintainability

TL;DR

, and

. Evaluation uses CodeBERTScore to assess functional similarity and Radon-based metrics for maintainability, supplemented by human judgments from 11 Python-expert participants. The results show measurable improvements in maintainability metrics and high functional similarity between generated and reference code, indicating that targeted maintainability objectives can be successfully integrated into AI-assisted code generation. The publicly released replication package and dataset underscore the practical impact for researchers and practitioners aiming to reduce technical debt and facilitate sustainable software development with AI-assisted tooling.

Abstract

Paper Structure (30 sections, 5 figures, 11 tables)

This paper contains 30 sections, 5 figures, 11 tables.

Introduction
Research Questions
Motivation
Our contributions are twofold:
Background
Maintainability
Parameter Efficient fine tuning (PEFT)
Related Works
Methodology
Selecting Dataset for Maintainability
Diversity of data and rationale for choosing the Dataset:
Instructing GPT4 to generate Maintainable code
Rationale for using the GPT4 model to generate a Training dataset for maintainability?
Evaluating the functional similarity of the generated GPT 4 code to the original code
Enhancing the Dataset with a Maintainability Metric
...and 15 more sections

Figures (5)

Figure 1: Steps used for Curating Datasets and Fine-tuning LLM
Figure 2: Comparing the distribution of metrics for CodeAlpaca Test split with WizardCoder 13B (Blue(FT model), Orange(Dataset) and Green(Base Model)
Figure 3: Comparing the distribution of metrics for Commitpackft Test split with WizardCoder 13B
Figure 4: Comparing the distribution of metrics for the CodeAlpaca Test split with GPT 3.5
Figure 5: Comparing the distribution of metrics for Commitpackft Test split with GPT-3.5

Better Python Programming for all: With the focus on Maintainability

TL;DR

Abstract

Better Python Programming for all: With the focus on Maintainability

Authors

TL;DR

Abstract

Table of Contents

Figures (5)