Table of Contents
Fetching ...

Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration

Ali Kazemi Arani, Triet Huynh Minh Le, Mansooreh Zahedi, M. Ali Babar

TL;DR

This systematic literature review analyzes 52 studies (2000–2023) on applying learning-based methods to Continuous Integration (CI). It maps ML techniques onto six CI phases and ten automated tasks, detailing data sources, data types, feature engineering, model training, tuning, and evaluation practices. Key findings include a strong focus on Regression Testing and Build Validation, prevalent use of data/code metadata features, dominance of supervised learning with DT and RL for certain tasks, and diverse evaluation metrics (e.g., Recall, Precision, F1, APFD/NAPFD, AUC) with a trend toward RL-based approaches in Test Optimization. The study identifies gaps such as limited qualitative evaluations, underexplored CI phases, data-drift considerations, security concerns, and a need for standardized, multi-metric assessment frameworks to advance practical ML-based CI automation.

Abstract

Context: Machine learning (ML) and deep learning (DL) analyze raw data to extract valuable insights in specific phases. The rise of continuous practices in software projects emphasizes automating Continuous Integration (CI) with these learning-based methods, while the growing adoption of such approaches underscores the need for systematizing knowledge. Objective: Our objective is to comprehensively review and analyze existing literature concerning learning-based methods within the CI domain. We endeavour to identify and analyse various techniques documented in the literature, emphasizing the fundamental attributes of training phases within learning-based solutions in the context of CI. Method: We conducted a Systematic Literature Review (SLR) involving 52 primary studies. Through statistical and thematic analyses, we explored the correlations between CI tasks and the training phases of learning-based methodologies across the selected studies, encompassing a spectrum from data engineering techniques to evaluation metrics. Results: This paper presents an analysis of the automation of CI tasks utilizing learning-based methods. We identify and analyze nine types of data sources, four steps in data preparation, four feature types, nine subsets of data features, five approaches for hyperparameter selection and tuning, and fifteen evaluation metrics. Furthermore, we discuss the latest techniques employed, existing gaps in CI task automation, and the characteristics of the utilized learning-based techniques. Conclusion: This study provides a comprehensive overview of learning-based methods in CI, offering valuable insights for researchers and practitioners developing CI task automation. It also highlights the need for further research to advance these methods in CI.

Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration

TL;DR

This systematic literature review analyzes 52 studies (2000–2023) on applying learning-based methods to Continuous Integration (CI). It maps ML techniques onto six CI phases and ten automated tasks, detailing data sources, data types, feature engineering, model training, tuning, and evaluation practices. Key findings include a strong focus on Regression Testing and Build Validation, prevalent use of data/code metadata features, dominance of supervised learning with DT and RL for certain tasks, and diverse evaluation metrics (e.g., Recall, Precision, F1, APFD/NAPFD, AUC) with a trend toward RL-based approaches in Test Optimization. The study identifies gaps such as limited qualitative evaluations, underexplored CI phases, data-drift considerations, security concerns, and a need for standardized, multi-metric assessment frameworks to advance practical ML-based CI automation.

Abstract

Context: Machine learning (ML) and deep learning (DL) analyze raw data to extract valuable insights in specific phases. The rise of continuous practices in software projects emphasizes automating Continuous Integration (CI) with these learning-based methods, while the growing adoption of such approaches underscores the need for systematizing knowledge. Objective: Our objective is to comprehensively review and analyze existing literature concerning learning-based methods within the CI domain. We endeavour to identify and analyse various techniques documented in the literature, emphasizing the fundamental attributes of training phases within learning-based solutions in the context of CI. Method: We conducted a Systematic Literature Review (SLR) involving 52 primary studies. Through statistical and thematic analyses, we explored the correlations between CI tasks and the training phases of learning-based methodologies across the selected studies, encompassing a spectrum from data engineering techniques to evaluation metrics. Results: This paper presents an analysis of the automation of CI tasks utilizing learning-based methods. We identify and analyze nine types of data sources, four steps in data preparation, four feature types, nine subsets of data features, five approaches for hyperparameter selection and tuning, and fifteen evaluation metrics. Furthermore, we discuss the latest techniques employed, existing gaps in CI task automation, and the characteristics of the utilized learning-based techniques. Conclusion: This study provides a comprehensive overview of learning-based methods in CI, offering valuable insights for researchers and practitioners developing CI task automation. It also highlights the need for further research to advance these methods in CI.
Paper Structure (33 sections, 1 equation, 6 figures, 22 tables)

This paper contains 33 sections, 1 equation, 6 figures, 22 tables.

Figures (6)

  • Figure 1: The four phases of ML life cycle. Note: the required steps for training an ML model are distinguished by numbers
  • Figure 2: Overview of the research methodology. Note: Two arrows with opposite directions present the iterative actions. The "I" and "E" stand for the Inclusion and Exclusion criteria according to Table \ref{['Table:InclusionExclusion']}, respectively, and numbers in parenthesis present the total number of selected papers in each step.
  • Figure 3: Number of selected studies published per year and their distribution over publication venues. Note: No paper was published between 2006 and 2014 --- Due to running the search string on July 2023, and snowballing on August 2023, the list of published papers in 2023 is incomplete.
  • Figure 4: Word cloud of the keywords in selected primary studies in the CSE area.
  • Figure 5: Overview of connection between six CI phases and their in/output.
  • ...and 1 more figures