Table of Contents
Fetching ...

Assessing Data Augmentation-Induced Bias in Training and Testing of Machine Learning Models

Riddhi More, Jeremy S. Bradbury

TL;DR

This paper tackles the problem of data augmentation-induced bias in software engineering ML tasks, focusing on cases where augmented samples appear in both training and testing. It employs a case study on flaky-test classification using a CodeBERT-based Siamese model with an adapted SMOTE augmentation to FlakyCat, conducting two experiments to separate augmentation benefits from bias. The findings show that augmentation can significantly improve average performance but also introduce systematic bias, with an average $F1$ difference of about 8% between augmented and independent test sets and notable category-dependent effects. The work provides practical guidelines for evaluation, such as maintaining a separate original validation set and adopting category-specific augmentation strategies, with implications extending to other SE tasks beyond flaky tests.

Abstract

Data augmentation has become a standard practice in software engineering to address limited or imbalanced data sets, particularly in specialized domains like test classification and bug detection where data can be scarce. Although techniques such as SMOTE and mutation-based augmentation are widely used in software testing and debugging applications, a rigorous understanding of how augmented training data impacts model bias is lacking. It is especially critical to consider bias in scenarios where augmented data sets are used not just in training but also in testing models. Through a comprehensive case study of flaky test classification, we demonstrate how to test for bias and understand the impact that the inclusion of augmented samples in testing sets can have on model evaluation.

Assessing Data Augmentation-Induced Bias in Training and Testing of Machine Learning Models

TL;DR

This paper tackles the problem of data augmentation-induced bias in software engineering ML tasks, focusing on cases where augmented samples appear in both training and testing. It employs a case study on flaky-test classification using a CodeBERT-based Siamese model with an adapted SMOTE augmentation to FlakyCat, conducting two experiments to separate augmentation benefits from bias. The findings show that augmentation can significantly improve average performance but also introduce systematic bias, with an average difference of about 8% between augmented and independent test sets and notable category-dependent effects. The work provides practical guidelines for evaluation, such as maintaining a separate original validation set and adopting category-specific augmentation strategies, with implications extending to other SE tasks beyond flaky tests.

Abstract

Data augmentation has become a standard practice in software engineering to address limited or imbalanced data sets, particularly in specialized domains like test classification and bug detection where data can be scarce. Although techniques such as SMOTE and mutation-based augmentation are widely used in software testing and debugging applications, a rigorous understanding of how augmented training data impacts model bias is lacking. It is especially critical to consider bias in scenarios where augmented data sets are used not just in training but also in testing models. Through a comprehensive case study of flaky test classification, we demonstrate how to test for bias and understand the impact that the inclusion of augmented samples in testing sets can have on model evaluation.

Paper Structure

This paper contains 17 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: F1 score by category Phase A vs Phase B from Experiment 1
  • Figure 2: F1 scores by category of testing set 1 vs testing set 2 from Experiment 2