Evaluation of large language models for assessing code maintainability

Marc Dillmann; Julien Siebert; Adam Trendowicz

Evaluation of large language models for assessing code maintainability

Marc Dillmann, Julien Siebert, Adam Trendowicz

TL;DR

This study investigates whether cross-entropy between LLM-generated code and actual Java code can indicate maintainability at the class level. Using 10 pretrained models (GPT-2 and Llama-2 based) and Schnappinger et al.’s expert-rated dataset, it computes cross-entropy over code chunks and measures logical lines of code (LLOC) to predict five maintainability dimensions. The main finding is that cross-entropy predicts maintainability only after controlling for LLOC; without this control, associations can invert, and model size has little impact. The results suggest LLM-derived cross-entropy can serve as a rough oracle for subjective code quality, but its utility depends on controlling for code size and likely requires integration with additional metrics for practical use.

Abstract

Increased availability of open-source software repositories and recent advances in code analysis using large language models (LLMs) has triggered a wave of new work to automate software engineering tasks that were previously very difficult to automate. In this paper, we investigate a recent line of work that hypothesises that comparing the probability of code generated by LLMs with the probability the current code would have had can indicate potential quality problems. We investigate the association between the cross-entropy of code generated by ten different models (based on GPT2 and Llama2) and the following quality aspects: readability, understandability, complexity, modularisation, and overall maintainability assessed by experts and available in an benchmark dataset. Our results show that, controlling for the number of logical lines of codes (LLOC), cross-entropy computed by LLMs is indeed a predictor of maintainability on a class level (the higher the cross-entropy the lower the maintainability). However, this relation is reversed when one does not control for LLOC (e.g., comparing small classes with longer ones). Furthermore, while the complexity of LLMs affects the range of cross-entropy (smaller models tend to have a wider range of cross-entropy), this plays a significant role in predicting maintainability aspects. Our study limits itself on ten different pretrained models (based on GPT2 and Llama2) and on maintainability aspects collected by Schnappinger et al. When controlling for logical lines of code (LLOC), cross-entropy is a predictor of maintainability. However, while related work has shown the potential usefulness of cross-entropy at the level of tokens or short sequences, at the class level this criterion alone may prove insufficient to predict maintainability and further research is needed to make best use of this information in practice.

Evaluation of large language models for assessing code maintainability

TL;DR

Abstract

Paper Structure (8 sections, 4 figures, 8 tables)

This paper contains 8 sections, 4 figures, 8 tables.

Introduction
Related Work
Method
Results
Comparison with other related work
Discussion
Threats to validity
Conclusion

Figures (4)

Figure 1: Overview of the experimental design
Figure 2: Visualization of the relationship (a) between overall maintainability (Ov.) and cross-entropy (as computed by the model bloomz-1b1 (M2)); (b) between overall maintainability (Ov.) and LLOC.
Figure 3: Association between cross-entropy (measured by the 10 models) and the probability that experts would answer strongly agree for Ov., Rd., and Ud. and between cross-entropy and strongly disagree for Cx. and Md. (a) shows the association without stratification. (b) when stratifying by LLOC. Statistics for LLOC: min: 4, Q1: 17.0, Q2 (median): 56.5, Q3: 153.5, max: 1627.
Figure 4: Visualisation of the feature space consisting of cross-entropy (x-axis, log scale) and LLOC (y-axis, log scale); The colour scale represents the probability associated with the class (i.e. P(strongly agree) for Ov., Rd. and Ud. and P(strongly disagree) for Cx. and Md.).

Evaluation of large language models for assessing code maintainability

TL;DR

Abstract

Evaluation of large language models for assessing code maintainability

Authors

TL;DR

Abstract

Table of Contents

Figures (4)