Table of Contents
Fetching ...

MBL-CPDP: A Multi-objective Bilevel Method for Cross-Project Defect Prediction via Automated Machine Learning

Jiaxin Chen, Jinliang Ding, Kay Chen Tan, Jiancheng Qian, Ke Li

TL;DR

Extensive empirical results show that MBL-CPDP outperforms the comparison methods, demonstrating its superior adaptability and comprehensive performance evaluation capability.

Abstract

Cross-project defect prediction (CPDP) leverages machine learning (ML) techniques to proactively identify software defects, especially where project-specific data is scarce. However, developing a robust ML pipeline with optimal hyperparameters that effectively use cross-project information and yield satisfactory performance remains challenging. In this paper, we resolve this bottleneck by formulating CPDP as a multi-objective bilevel optimization (MBLO) method, dubbed MBL-CPDP. It comprises two nested problems: the upper-level, a multi-objective combinatorial optimization problem, enhances robustness and efficiency in optimizing ML pipelines, while the lower-level problem is an expensive optimization problem that focuses on tuning their optimal hyperparameters. Due to the high-dimensional search space characterized by feature redundancy and inconsistent data distributions, the upper-level problem combines feature selection, transfer learning, and classification to leverage limited and heterogeneous historical data. Meanwhile, an ensemble learning method is proposed to capture differences in cross-project distribution and generalize across diverse datasets. Finally, a MBLO algorithm is presented to solve this problem while achieving high adaptability effectively. To evaluate the performance of MBL-CPDP, we compare it with five automated ML tools and $50$ CPDP techniques across $20$ projects. Extensive empirical results show that MBL-CPDPoutperforms the comparison methods, demonstrating its superior adaptability and comprehensive performance evaluation capability.

MBL-CPDP: A Multi-objective Bilevel Method for Cross-Project Defect Prediction via Automated Machine Learning

TL;DR

Extensive empirical results show that MBL-CPDP outperforms the comparison methods, demonstrating its superior adaptability and comprehensive performance evaluation capability.

Abstract

Cross-project defect prediction (CPDP) leverages machine learning (ML) techniques to proactively identify software defects, especially where project-specific data is scarce. However, developing a robust ML pipeline with optimal hyperparameters that effectively use cross-project information and yield satisfactory performance remains challenging. In this paper, we resolve this bottleneck by formulating CPDP as a multi-objective bilevel optimization (MBLO) method, dubbed MBL-CPDP. It comprises two nested problems: the upper-level, a multi-objective combinatorial optimization problem, enhances robustness and efficiency in optimizing ML pipelines, while the lower-level problem is an expensive optimization problem that focuses on tuning their optimal hyperparameters. Due to the high-dimensional search space characterized by feature redundancy and inconsistent data distributions, the upper-level problem combines feature selection, transfer learning, and classification to leverage limited and heterogeneous historical data. Meanwhile, an ensemble learning method is proposed to capture differences in cross-project distribution and generalize across diverse datasets. Finally, a MBLO algorithm is presented to solve this problem while achieving high adaptability effectively. To evaluate the performance of MBL-CPDP, we compare it with five automated ML tools and CPDP techniques across projects. Extensive empirical results show that MBL-CPDPoutperforms the comparison methods, demonstrating its superior adaptability and comprehensive performance evaluation capability.

Paper Structure

This paper contains 29 sections, 5 equations, 15 figures, 4 tables, 4 algorithms.

Figures (15)

  • Figure 1: The overall architecture of MBL-CPDP.
  • Figure 2: Violin plots and box plots of Scott-Knott test ranks achieved by six AutoML tools on AUC, ACC, Recall, F1, and MCC.
  • Figure 3: Total Scott-Knott test ranks achieved by six AutoML tools (A smaller sum of ranks in the bar chart indicates superior performance. The line graph shows the frequency of top ranks, with higher values denoting better performance.).
  • Figure 4: Percentage of the large, medium, small, and equal $A_{12}$ effect size, respectively, when comparing MBL-CPDP with other five AutoML tools on AUC, ACC, Recall, F1, and MCC.
  • Figure 5: The final solutions obtained by the six AutoML tools for three projects: EQ, LC, and Safe, respectively.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4