Table of Contents
Fetching ...

Investigating Reproducibility in Deep Learning-Based Software Fault Prediction

Adil Mukhtar, Dietmar Jannach, Franz Wotawa

TL;DR

This study evaluates reproducibility in deep learning–based software fault prediction by systematically reviewing 56 articles from 2019–2022. It applies a four-category reproducibility framework (Source Code, Hyperparameter Tuning, Dataset, Evaluation) to quantify artifacts provided by authors. The findings show widespread code sharing but persistent gaps in baseline code, tuning procedures, and data preprocessing, raising concerns about exact reproducibility and fair comparisons. The work highlights the need for formal reproducibility guidelines, better review practices, and incentives to ensure reliable, verifiable progress in this subfield of software engineering.

Abstract

Over the past few years, deep learning methods have been applied for a wide range of Software Engineering (SE) tasks, including in particular for the important task of automatically predicting and localizing faults in software. With the rapid adoption of increasingly complex machine learning models, it however becomes more and more difficult for scholars to reproduce the results that are reported in the literature. This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared. Given some recent -- and very worrying -- findings regarding reproducibility and progress in other areas of applied machine learning, the goal of this work is to analyze to what extent the field of software engineering, in particular in the area of software fault prediction, is plagued by similar problems. We have therefore conducted a systematic review of the current literature and examined the level of reproducibility of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences. Our analysis revealed that scholars are apparently largely aware of the reproducibility problem, and about two thirds of the papers provide code for their proposed deep learning models. However, it turned out that in the vast majority of cases, crucial elements for reproducibility are missing, such as the code of the compared baselines, code for data pre-processing or code for hyperparameter tuning. In these cases, it therefore remains challenging to exactly reproduce the results in the current research literature. Overall, our meta-analysis therefore calls for improved research practices to ensure the reproducibility of machine-learning based research.

Investigating Reproducibility in Deep Learning-Based Software Fault Prediction

TL;DR

This study evaluates reproducibility in deep learning–based software fault prediction by systematically reviewing 56 articles from 2019–2022. It applies a four-category reproducibility framework (Source Code, Hyperparameter Tuning, Dataset, Evaluation) to quantify artifacts provided by authors. The findings show widespread code sharing but persistent gaps in baseline code, tuning procedures, and data preprocessing, raising concerns about exact reproducibility and fair comparisons. The work highlights the need for formal reproducibility guidelines, better review practices, and incentives to ensure reliable, verifiable progress in this subfield of software engineering.

Abstract

Over the past few years, deep learning methods have been applied for a wide range of Software Engineering (SE) tasks, including in particular for the important task of automatically predicting and localizing faults in software. With the rapid adoption of increasingly complex machine learning models, it however becomes more and more difficult for scholars to reproduce the results that are reported in the literature. This is in particular the case when the applied deep learning models and the evaluation methodology are not properly documented and when code and data are not shared. Given some recent -- and very worrying -- findings regarding reproducibility and progress in other areas of applied machine learning, the goal of this work is to analyze to what extent the field of software engineering, in particular in the area of software fault prediction, is plagued by similar problems. We have therefore conducted a systematic review of the current literature and examined the level of reproducibility of 56 research articles that were published between 2019 and 2022 in top-tier software engineering conferences. Our analysis revealed that scholars are apparently largely aware of the reproducibility problem, and about two thirds of the papers provide code for their proposed deep learning models. However, it turned out that in the vast majority of cases, crucial elements for reproducibility are missing, such as the code of the compared baselines, code for data pre-processing or code for hyperparameter tuning. In these cases, it therefore remains challenging to exactly reproduce the results in the current research literature. Overall, our meta-analysis therefore calls for improved research practices to ensure the reproducibility of machine-learning based research.
Paper Structure (25 sections, 7 figures, 4 tables)

This paper contains 25 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Categorization of Software Fault Prediction Techniques, adapted from WongGLAW2016
  • Figure 2: Articles selection process
  • Figure 3: Frequencies of different deep learning architectures
  • Figure 4: Analysis of Source Code Reproducibility Variables
  • Figure 5: Analysis of Hyperparameter Tuning Reproducibility Variables
  • ...and 2 more figures