Table of Contents
Fetching ...

Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning

Yasin I. Tepeli, Mathijs de Wolf, Joana P. Gonçalves

TL;DR

This work addresses selection bias in machine learning by introducing Metric-DST, a diversity-guided self-training framework built on metric learning to create a diverse, class-aware embedding space for pseudo-labeling unlabeled data. By sampling diverse regions of the embedding space rather than maximizing confidence, Metric-DST mitigates confirmation bias and improves generalization under biased training data. Across generated, real-world, and synthetic lethality datasets with varied bias scenarios, Metric-DST often matches or surpasses supervised performance and generally outperforms conventional self-training, especially when labeled data are scarce. The approach is flexible, classifier-agnostic, and broadly applicable for fairness-aware predictions in the presence of selection bias.

Abstract

Selection bias poses a critical challenge for fairness in machine learning, as models trained on data that is less representative of the population might exhibit undesirable behavior for underrepresented profiles. Semi-supervised learning strategies like self-training can mitigate selection bias by incorporating unlabeled data into model training to gain further insight into the distribution of the population. However, conventional self-training seeks to include high-confidence data samples, which may reinforce existing model bias and compromise effectiveness. We propose Metric-DST, a diversity-guided self-training strategy that leverages metric learning and its implicit embedding space to counter confidence-based bias through the inclusion of more diverse samples. Metric-DST learned more robust models in the presence of selection bias for generated and real-world datasets with induced bias, as well as a molecular biology prediction task with intrinsic bias. The Metric-DST learning strategy offers a flexible and widely applicable solution to mitigate selection bias and enhance fairness of machine learning models.

Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning

TL;DR

This work addresses selection bias in machine learning by introducing Metric-DST, a diversity-guided self-training framework built on metric learning to create a diverse, class-aware embedding space for pseudo-labeling unlabeled data. By sampling diverse regions of the embedding space rather than maximizing confidence, Metric-DST mitigates confirmation bias and improves generalization under biased training data. Across generated, real-world, and synthetic lethality datasets with varied bias scenarios, Metric-DST often matches or surpasses supervised performance and generally outperforms conventional self-training, especially when labeled data are scarce. The approach is flexible, classifier-agnostic, and broadly applicable for fairness-aware predictions in the presence of selection bias.

Abstract

Selection bias poses a critical challenge for fairness in machine learning, as models trained on data that is less representative of the population might exhibit undesirable behavior for underrepresented profiles. Semi-supervised learning strategies like self-training can mitigate selection bias by incorporating unlabeled data into model training to gain further insight into the distribution of the population. However, conventional self-training seeks to include high-confidence data samples, which may reinforce existing model bias and compromise effectiveness. We propose Metric-DST, a diversity-guided self-training strategy that leverages metric learning and its implicit embedding space to counter confidence-based bias through the inclusion of more diverse samples. Metric-DST learned more robust models in the presence of selection bias for generated and real-world datasets with induced bias, as well as a molecular biology prediction task with intrinsic bias. The Metric-DST learning strategy offers a flexible and widely applicable solution to mitigate selection bias and enhance fairness of machine learning models.

Paper Structure

This paper contains 12 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: Overview of the Metric-DST methodology. A Metric-DST iteration encompasses 1) training a metric learning model on labeled data that can be used to transform both labeled and unlabeled samples into an embedding space, 2) obtaining predicted pseudo-labels and model confidence values for unlabeled samples using k-nearest neighbors (kNN) on the embedding space representations, 3) selecting diverse pseudo-labeled samples distributed across the learned embedding space and adding them to the labeled set for the subsequent iteration.
  • Figure 2: Mitigation of selection bias induced to generated and real-world benchmark data.(a) Samples selected by delta bias ($\Delta_0=\Delta_1=(0,0)$ for classes $0$ and $1$) highlighted on a scatter plot of the artificially generated 2D moons dataset. Performance (AUROC) of supervised and semi-supervised Metric-(D)ST methods using metric learning and kNN on: (b) generated 2D moons dataset of 2000 samples with four delta bias induction settings, selecting 100 or 200 samples with $\Delta_0=\Delta_1=(0,0)$ and $\{\Delta_0=(1,0.5), \Delta_1=(0,0)\}$, (c) generated higher-dimensional datasets of 2000 samples and 16, 32, 64, and 128 features with hierarchy bias induction (ratio $b=0.9$) selecting 100 or 200 samples, and (d) eight real-worlddatasets with hierarchy bias induction ($b=0.9$) targeting the selection of 60 and 100 samples. Results of 10-fold cross-validation, with all methods evaluated using the same folds (train/test splits) and the same divisions of the train sets into labeled and unlabeled subsets. Methods included: supervised model trained on the complete labeled set (No Bias), on a biased selection (Bias), or on randomly selected samples (Random, same number as the biased selection); and semi-supervised models, using conventional self-training (Metric-ST) or diversity-guided self-training (Metric-DST) on the biased labeled train set plus the unlabeled train set. The red asterisks stand for significant difference (p-value<0.05) between the performances of the method with asterisk and the biased supervised method based on a two-sided Wilcoxon signed-rank test.
  • Figure 3: Mitigation of intrinsic selection bias for synthetic lethality prediction. Prediction performance (AUPRC) of synthetic lethality prediction models trained and tested per cancer type using supervised learning or the semi-supervised Metric-ST and Metric-DST methods for 10 train/test splits. Three types of splits were used to control the degree of similarity in selection bias between the train and test sets: (a)Randomized split, (b)Double holdout, (c)Cross dataset. For (a), (b), and (c), boxplots include all points (no outlier detection), and the white circles denote the mean values. (d) Average Euclidean distances between pseudo-labeled samples selected by Metric-ST and Metric-DST per class, with diamonds denoting outliers. The red asterisks denote significant differences in performance (p-value < 0.05) between the method with an asterisk and the biased supervised method based on a two-sided Wilcoxon signed-rank test.
  • Figure 4: UMAP projections of the SL dataset for BRCA. On the left, the training samples are highlighted before the training. The top right plot shows the pseudo-labeled samples selected by Metric-ST and the pseudo-labeled samples selected by Metric-ST during the training. The number of samples of each class is stated in parentheses. The highlighted box highlights a cluster dominated by gene pairs containing the gene CDH1.