Table of Contents
Fetching ...

Active Data Sampling and Generation for Bias Remediation

Antonio Maratea, Rita Perna

TL;DR

This paper tackles bias in AI arising from non-probabilistic sampling and proposes samplation, a non-probabilistic strategy that creates discriminant-value reserves via data augmentation (e.g., SMOTE) and introduces a small, reversely biased synthetic sample during fine-tuning to improve fairness. Applied to a visual semantic role labeling task on the imSitu dataset, samplation demonstrates that a tiny fraction of artificial data can fully cure a substantial initial bias (e.g., $90/10$) with only a modest impact on accuracy. The approach blends ideas from oversampling, active sampling, and reservoir sampling to address bias at the data level during model refinement, offering a cost-effective path toward fairer large pretrained models. The findings suggest practical implications for deployment where data collection is non-random or constrained, highlighting the balance between fairness gains and potential accuracy trade-offs.

Abstract

Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.

Active Data Sampling and Generation for Bias Remediation

TL;DR

This paper tackles bias in AI arising from non-probabilistic sampling and proposes samplation, a non-probabilistic strategy that creates discriminant-value reserves via data augmentation (e.g., SMOTE) and introduces a small, reversely biased synthetic sample during fine-tuning to improve fairness. Applied to a visual semantic role labeling task on the imSitu dataset, samplation demonstrates that a tiny fraction of artificial data can fully cure a substantial initial bias (e.g., ) with only a modest impact on accuracy. The approach blends ideas from oversampling, active sampling, and reservoir sampling to address bias at the data level during model refinement, offering a cost-effective path toward fairer large pretrained models. The findings suggest practical implications for deployment where data collection is non-random or constrained, highlighting the balance between fairness gains and potential accuracy trade-offs.

Abstract

Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.

Paper Structure

This paper contains 15 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Example of Visual Semantic Role Labeling. Image source: yatskar2016situation
  • Figure 2: Test-time results on the model predictions of the proposed method when the initial imbalance is 90%-10%. $X$ axis, sample size; $Y$ axis, imbalance ratio. Bold line represents the average.
  • Figure 3: Test-time results on the model predictions of the proposed method when the initial imbalance is 80%-20%. $X$ axis, sample size; $Y$ axis, imbalance ratio. Bold line represents the average.
  • Figure 4: Test-time results on the model predictions of the proposed method when the initial imbalance is 70%-30%. $X$ axis, sample size; $Y$ axis, imbalance ratio. Bold line represents the average.
  • Figure 5: Test-time results on the model predictions of the proposed method when the initial imbalance is 60%-40%. $X$ axis, sample size; $Y$ axis, imbalance ratio. Bold line represents the average.