Table of Contents
Fetching ...

Empirical Analysis on CI/CD Pipeline Evolution in Machine Learning Projects

Dhia Elhaq Rzig, Alaa Houerbi, Rahul Ghanshyam Chavan, Foyzul Hassan

TL;DR

This paper delivers the first empirical analysis of how CI/CD configurations evolve in ML projects and how these changes co-evolve with ML code. It combines manual labeling, association-rule mining, and AST-based change mining on a dataset of 508 ML projects (343 manually analyzed commits, 15,634 CI changes), revealing a 14-category co-change taxonomy and prominent patterns around build policy and testing. Key findings include a strong association between experienced developers and CI/CD modification activity, prevalent bad practices such as embedding dependencies in CI files and limited use of automated testing discovery, and actionable change patterns across CI/CD lifecycle phases. The study offers practical implications for ML practitioners, CI/CD tool builders, and researchers, and provides publicly available data and scripts to support replication and further work.

Abstract

The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects and devised a taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a CI/CD configuration change clustering tool that identified frequent CI/CD configuration change patterns in 15,634 commits. Furthermore, we measured the expertise of ML developers who modify CI/CD configurations. Based on this analysis, we found that 61.8% of commits include a change to the build policy and minimal changes related to performance and maintainability compared to general open-source projects. Additionally, the co-evolution analysis identified that CI/CD configurations, in many cases, changed unnecessarily due to bad practices such as the direct inclusion of dependencies and a lack of usage of standardized testing frameworks. More practices were found through the change patterns analysis consisting of using deprecated settings and reliance on a generic build language. Finally, our developer's expertise analysis suggests that experienced developers are more inclined to modify CI/CD configurations.

Empirical Analysis on CI/CD Pipeline Evolution in Machine Learning Projects

TL;DR

This paper delivers the first empirical analysis of how CI/CD configurations evolve in ML projects and how these changes co-evolve with ML code. It combines manual labeling, association-rule mining, and AST-based change mining on a dataset of 508 ML projects (343 manually analyzed commits, 15,634 CI changes), revealing a 14-category co-change taxonomy and prominent patterns around build policy and testing. Key findings include a strong association between experienced developers and CI/CD modification activity, prevalent bad practices such as embedding dependencies in CI files and limited use of automated testing discovery, and actionable change patterns across CI/CD lifecycle phases. The study offers practical implications for ML practitioners, CI/CD tool builders, and researchers, and provides publicly available data and scripts to support replication and further work.

Abstract

The growing popularity of machine learning (ML) and the integration of ML components with other software artifacts has led to the use of continuous integration and delivery (CI/CD) tools, such as Travis CI, GitHub Actions, etc. that enable faster integration and testing for ML projects. Such CI/CD configurations and services require synchronization during the life cycle of the projects. Several works discussed how CI/CD configuration and services change during their usage in traditional software systems. However, there is very limited knowledge of how CI/CD configuration and services change in ML projects. To fill this knowledge gap, this work presents the first empirical analysis of how CI/CD configuration evolves for ML software systems. We manually analyzed 343 commits collected from 508 open-source ML projects to identify common CI/CD configuration change categories in ML projects and devised a taxonomy of 14 co-changes in CI/CD and ML components. Moreover, we developed a CI/CD configuration change clustering tool that identified frequent CI/CD configuration change patterns in 15,634 commits. Furthermore, we measured the expertise of ML developers who modify CI/CD configurations. Based on this analysis, we found that 61.8% of commits include a change to the build policy and minimal changes related to performance and maintainability compared to general open-source projects. Additionally, the co-evolution analysis identified that CI/CD configurations, in many cases, changed unnecessarily due to bad practices such as the direct inclusion of dependencies and a lack of usage of standardized testing frameworks. More practices were found through the change patterns analysis consisting of using deprecated settings and reliance on a generic build language. Finally, our developer's expertise analysis suggests that experienced developers are more inclined to modify CI/CD configurations.
Paper Structure (21 sections, 2 equations, 6 figures, 1 table)

This paper contains 21 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Travis CI job Lifecycle.
  • Figure 2: Overview of Research Approach
  • Figure 3: Distribution of CI/CD change categories.
  • Figure 4: Distribution of commit categories
  • Figure 5: Change Patterns in .travis.yml configurations lifecycle.
  • ...and 1 more figures