Table of Contents
Fetching ...

When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction

Emmanuel Charleson Dapaah, Jens Grabowski

TL;DR

SDP performance is constrained by multiple co-occurring data quality issues. The authors conduct a large-scale empirical study across 374 SDP datasets and five classifiers, using Explainable Boosting Machines and stratified interaction analysis to quantify direct and conditional effects of five data quality dimensions. They identify near-universal co-occurrence of issues and stable performance-degradation thresholds, such as overlaps near 0.20, imbalance near 0.65-0.70, and irrelevance near 0.94, along with counterintuitive patterns like outliers sometimes improving performance. The findings emphasize data-aware model selection and multi-faceted preprocessing, and the work provides a replicable framework and thresholds to guide practice and future research in SDP and broader data-centric ML.

Abstract

Software Defect Prediction (SDP) models are central to proactive software quality assurance, yet their effectiveness is often constrained by the quality of available datasets. Prior research has typically examined single issues such as class imbalance or feature irrelevance in isolation, overlooking that real-world data problems frequently co-occur and interact. This study presents, to our knowledge, the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues (class imbalance, class overlap, irrelevant features, attribute noise, and outliers) across 374 datasets and five classifiers. We employ Explainable Boosting Machines together with stratified interaction analysis to quantify both direct and conditional effects under default hyperparameter settings, reflecting practical baseline usage. Our results show that co-occurrence is nearly universal: even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets. Irrelevant features and imbalance are nearly ubiquitous, while class overlap is the most consistently harmful issue. We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade. We also uncover counterintuitive patterns, such as outliers improving performance when irrelevant features are low, underscoring the importance of context-aware evaluation. Finally, we expose a performance-robustness trade-off: no single learner dominates under all conditions. By jointly analyzing prevalence, co-occurrence, thresholds, and conditional effects, our study directly addresses a persistent gap in SDP research. Hence, moving beyond isolated analyses to provide a holistic, data-aware understanding of how quality issues shape model performance in real-world settings.

When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction

TL;DR

SDP performance is constrained by multiple co-occurring data quality issues. The authors conduct a large-scale empirical study across 374 SDP datasets and five classifiers, using Explainable Boosting Machines and stratified interaction analysis to quantify direct and conditional effects of five data quality dimensions. They identify near-universal co-occurrence of issues and stable performance-degradation thresholds, such as overlaps near 0.20, imbalance near 0.65-0.70, and irrelevance near 0.94, along with counterintuitive patterns like outliers sometimes improving performance. The findings emphasize data-aware model selection and multi-faceted preprocessing, and the work provides a replicable framework and thresholds to guide practice and future research in SDP and broader data-centric ML.

Abstract

Software Defect Prediction (SDP) models are central to proactive software quality assurance, yet their effectiveness is often constrained by the quality of available datasets. Prior research has typically examined single issues such as class imbalance or feature irrelevance in isolation, overlooking that real-world data problems frequently co-occur and interact. This study presents, to our knowledge, the first large-scale empirical analysis in SDP that simultaneously examines five co-occurring data quality issues (class imbalance, class overlap, irrelevant features, attribute noise, and outliers) across 374 datasets and five classifiers. We employ Explainable Boosting Machines together with stratified interaction analysis to quantify both direct and conditional effects under default hyperparameter settings, reflecting practical baseline usage. Our results show that co-occurrence is nearly universal: even the least frequent issue (attribute noise) appears alongside others in more than 93% of datasets. Irrelevant features and imbalance are nearly ubiquitous, while class overlap is the most consistently harmful issue. We identify stable tipping points around 0.20 for class overlap, 0.65-0.70 for imbalance, and 0.94 for irrelevance, beyond which most models begin to degrade. We also uncover counterintuitive patterns, such as outliers improving performance when irrelevant features are low, underscoring the importance of context-aware evaluation. Finally, we expose a performance-robustness trade-off: no single learner dominates under all conditions. By jointly analyzing prevalence, co-occurrence, thresholds, and conditional effects, our study directly addresses a persistent gap in SDP research. Hence, moving beyond isolated analyses to provide a holistic, data-aware understanding of how quality issues shape model performance in real-world settings.

Paper Structure

This paper contains 44 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Box plots illustrating the distribution of data quality issues
  • Figure 2: Heatmap illustrating the pairwise co-occurrence of data quality issues
  • Figure 3: Heatmap illustrating the influence scores of data quality issues across different models. Darker colors indicate higher influence
  • Figure 4: Monotonic relationship between each data quality issue and model performance. The x-axis shows the severity of the issue, and the y-axis represents its additive contribution to Balanced Accuracy relative to the EBM intercept
  • Figure 5: Stratified Interaction Analysis with Boxplots