Table of Contents
Fetching ...

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

Suvodeep Majumder, Joymallya Chakraborty, Tim Menzies

TL;DR

This work investigates the effectiveness of semi-supervised learning (SSL) for defect prediction in software engineering by applying 55 SSL methods to a large and diverse dataset of 714 GitHub projects. It finds that co-training-based SSL, even with as little as $2.5\%$ of labeled data, can match or exceed fully supervised models while dramatically reducing labeling effort (approximately 40-fold). The study also shows that a mutual-teaching strategy improves recall over self-teaching, and that single-view co-training suffices for SE defect prediction, with multi-view offering no additional predictive benefit and higher running times. Overall, the results provide practical guidance for deploying SSL in software analytics and suggest substantial cost savings for defect prediction workflows, while highlighting areas for future research and broader applicability. The accompanying code and data enable replication and extension of the findings.

Abstract

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.

When Less is More: On the Value of "Co-training" for Semi-Supervised Software Defect Predictors

TL;DR

This work investigates the effectiveness of semi-supervised learning (SSL) for defect prediction in software engineering by applying 55 SSL methods to a large and diverse dataset of 714 GitHub projects. It finds that co-training-based SSL, even with as little as of labeled data, can match or exceed fully supervised models while dramatically reducing labeling effort (approximately 40-fold). The study also shows that a mutual-teaching strategy improves recall over self-teaching, and that single-view co-training suffices for SE defect prediction, with multi-view offering no additional predictive benefit and higher running times. Overall, the results provide practical guidance for deploying SSL in software analytics and suggest substantial cost savings for defect prediction workflows, while highlighting areas for future research and broader applicability. The accompanying code and data enable replication and extension of the findings.

Abstract

Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models. However, there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even there, those methods have been tested on just a handful of projects. This paper applies a wide range of 55 semi-supervised learners to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. Specifically, after labeling, just 2.5% of data, then make predictions that are competitive to those using 100% of the data. That said, co-training needs to be used cautiously since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). It is an open question, worthy of future work, to test if these reductions can be seen in other areas of software analytics. To assist with exploring other areas, all the codes used are available at https://github.com/ai-se/Semi-Supervised.
Paper Structure (28 sections, 8 figures, 5 tables)

This paper contains 28 sections, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Taxonomy of Semi-Supervised learning from Van et al. van2020survey. This paper explores 55 methods from the pinks. The other blue nodes are left for future work since these notes use methods that are either (a) very computationally expensive or (b) have been developed for data types not relevant to our target (defect prediction).
  • Figure 2: Framework
  • Figure 3: Examples of Scott-Knott results. In this figure, treatments with the same rank are assigned the same color.
  • Figure 4: The Recall results are divided into five statistically different groups. Note that larger values are better for recall. The fully supervised methods (that use labels on 100% of the data) are denoted DT, RF, LR, KNN, and SVM. Note that this group does not perform better or worse than the SSL methods (that use 2.5% of the data).
  • Figure 6: The false alarm results were divided into five statistically different groups. Note that smaller values are better for false alarm. Once again, we note that the fully supervised methods DT, RF, LR, KNN, and SVM do not perform outstandingly better or worse than the SSL methods.
  • ...and 3 more figures