Table of Contents
Fetching ...

From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu

Mohamed Almukhtar, Anwar Ghammam, Marouane Kessentini, Hua Ming

TL;DR

This work tackles the challenge of linking code changes to software quality in Python-based machine learning systems by introducing PyQu, a metric-driven tool that uses low-level software metrics and ML classifiers to identify quality-enhancing commits. Through a large-scale study of 3,340 open-source MLS projects, the authors extract, label, and validate quality changes, identifying 61 change types organized into 13 categories, including 25 novel changes not captured by prior tools. PyQu demonstrates strong effectiveness and generalizability, achieving average accuracy around 0.84–0.87 and ROC-AUC up to 0.91 across five quality attributes (Understandability, Reliability, Maintainability, Usability, Modularity), and identifies 2,338 quality-enhancing edits, only partially overlapping with existing detectors. The work provides a practical resource (the MLCodeQuality benchmark) and a comprehensive taxonomy to guide automated quality assessment and best-practice code changes in MLS, with implications for researchers, practitioners, and IDE/tool developers.

Abstract

In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 open-source Python ML projects, encompassing more than 3.7 million commits and 2.7 trillion lines of code. We introduce PyQu, a novel tool that leverages low level software metrics to identify quality-enhancing commits with an average accuracy, precision, and recall of 0.84 and 0.85 of average F1 score. Using PyQu and a thematic analysis, we identified 61 code changes, each demonstrating a direct impact on enhancing software quality, and we classified them into 13 categories based on contextual characteristics. 41% of the changes are newly discovered by our study and have not been identified by state-of-the-art Python changes detection tools. Our work offers a vital foundation for researchers, practitioners, educators, and tool developers, advancing the quest for automated quality assessment and best practices in Python-based ML software.

From Code Changes to Quality Gains: An Empirical Study in Python ML Systems with PyQu

TL;DR

This work tackles the challenge of linking code changes to software quality in Python-based machine learning systems by introducing PyQu, a metric-driven tool that uses low-level software metrics and ML classifiers to identify quality-enhancing commits. Through a large-scale study of 3,340 open-source MLS projects, the authors extract, label, and validate quality changes, identifying 61 change types organized into 13 categories, including 25 novel changes not captured by prior tools. PyQu demonstrates strong effectiveness and generalizability, achieving average accuracy around 0.84–0.87 and ROC-AUC up to 0.91 across five quality attributes (Understandability, Reliability, Maintainability, Usability, Modularity), and identifies 2,338 quality-enhancing edits, only partially overlapping with existing detectors. The work provides a practical resource (the MLCodeQuality benchmark) and a comprehensive taxonomy to guide automated quality assessment and best-practice code changes in MLS, with implications for researchers, practitioners, and IDE/tool developers.

Abstract

In an era shaped by Generative Artificial Intelligence for code generation and the rising adoption of Python-based Machine Learning systems (MLS), software quality has emerged as a major concern. As these systems grow in complexity and importance, a key obstacle lies in understanding exactly how specific code changes affect overall quality-a shortfall aggravated by the lack of quality assessment tools and a clear mapping between ML systems code changes and their quality effects. Although prior work has explored code changes in MLS, it mostly stops at what the changes are, leaving a gap in our knowledge of the relationship between code changes and the MLS quality. To address this gap, we conducted a large-scale empirical study of 3,340 open-source Python ML projects, encompassing more than 3.7 million commits and 2.7 trillion lines of code. We introduce PyQu, a novel tool that leverages low level software metrics to identify quality-enhancing commits with an average accuracy, precision, and recall of 0.84 and 0.85 of average F1 score. Using PyQu and a thematic analysis, we identified 61 code changes, each demonstrating a direct impact on enhancing software quality, and we classified them into 13 categories based on contextual characteristics. 41% of the changes are newly discovered by our study and have not been identified by state-of-the-art Python changes detection tools. Our work offers a vital foundation for researchers, practitioners, educators, and tool developers, advancing the quest for automated quality assessment and best practices in Python-based ML software.

Paper Structure

This paper contains 43 sections, 5 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Research Methodology Overview.