Table of Contents
Fetching ...

Ghost Echoes Revealed: Benchmarking Maintainability Metrics and Machine Learning Predictions Against Human Assessments

Markus Borg, Marwa Ezzouhri, Adam Tornhill

TL;DR

The study benchmarks maintainability prediction methods by comparing SotA ML models against industry tools (CodeScene Code Health, MS-MI, SonarQube) and a LoC baseline using the MainData ground-truth dataset. Code Health matches SotA ML in accuracy and outperforms the average human expert, while also providing actionable code-smell remediation guidance; SonarQube exhibits many false positives and weaker predictive power. Across underlying metrics, SotA ML leads in AUC ($=0.97$), with Code Health close behind ($=0.95$), whereas TD Time is moderate and TD Ratio is notably poor ($AUC=0.60$). The authors advocate adopting Code Health for reliable, actionable maintainability assessments and caution against relying on SonarQube alone, calling for more robust benchmarks and reevaluation of past SonarQube-based ground truths.

Abstract

As generative AI is expected to increase global code volumes, the importance of maintainability from a human perspective will become even greater. Various methods have been developed to identify the most important maintainability issues, including aggregated metrics and advanced Machine Learning (ML) models. This study benchmarks several maintainability prediction approaches, including State-of-the-Art (SotA) ML, SonarQube's Maintainability Rating, CodeScene's Code Health, and Microsoft's Maintainability Index. Our results indicate that CodeScene matches the accuracy of SotA ML and outperforms the average human expert. Importantly, unlike SotA ML, CodeScene also provides end users with actionable code smell details to remedy identified issues. Finally, caution is advised with SonarQube due to its tendency to generate many false positives. Unfortunately, our findings call into question the validity of previous studies that solely relied on SonarQube output for establishing ground truth labels. To improve reliability in future maintainability and technical debt studies, we recommend employing more accurate metrics. Moreover, reevaluating previous findings with Code Health would mitigate this revealed validity threat.

Ghost Echoes Revealed: Benchmarking Maintainability Metrics and Machine Learning Predictions Against Human Assessments

TL;DR

The study benchmarks maintainability prediction methods by comparing SotA ML models against industry tools (CodeScene Code Health, MS-MI, SonarQube) and a LoC baseline using the MainData ground-truth dataset. Code Health matches SotA ML in accuracy and outperforms the average human expert, while also providing actionable code-smell remediation guidance; SonarQube exhibits many false positives and weaker predictive power. Across underlying metrics, SotA ML leads in AUC (), with Code Health close behind (), whereas TD Time is moderate and TD Ratio is notably poor (). The authors advocate adopting Code Health for reliable, actionable maintainability assessments and caution against relying on SonarQube alone, calling for more robust benchmarks and reevaluation of past SonarQube-based ground truths.

Abstract

As generative AI is expected to increase global code volumes, the importance of maintainability from a human perspective will become even greater. Various methods have been developed to identify the most important maintainability issues, including aggregated metrics and advanced Machine Learning (ML) models. This study benchmarks several maintainability prediction approaches, including State-of-the-Art (SotA) ML, SonarQube's Maintainability Rating, CodeScene's Code Health, and Microsoft's Maintainability Index. Our results indicate that CodeScene matches the accuracy of SotA ML and outperforms the average human expert. Importantly, unlike SotA ML, CodeScene also provides end users with actionable code smell details to remedy identified issues. Finally, caution is advised with SonarQube due to its tendency to generate many false positives. Unfortunately, our findings call into question the validity of previous studies that solely relied on SonarQube output for establishing ground truth labels. To improve reliability in future maintainability and technical debt studies, we recommend employing more accurate metrics. Moreover, reevaluating previous findings with Code Health would mitigate this revealed validity threat.
Paper Structure (18 sections, 1 equation, 4 figures, 1 table)

This paper contains 18 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: ROC curves for UC1 Maintainability Prediction.
  • Figure 2: ROC curves for UC2 Liability Prediction.
  • Figure 3: SVGFEFuncBElement.java, a very small file in JSweet. SonarQube yields a false positive, all other maintainability prediction approaches consider the file maintainable. The five lines of code contain three SonarQube code smells (expanded in pink) which translates into a TD Time of 35 min and a TD Ratio of 0.233, resulting in Maintainability Rating D.
  • Figure 4: SVGMaskElement.java, a small file in JSweet with 31 SonarQube code smells. The file has a TD Time of 281 min and a TD Ratio of 0.335, resulting in Maintainability Rating D. According to SonarQube, this is the least maintainable file in the entire MainData. All non-SonarQube prediction approaches consider the file maintainable.