Active Data Sampling and Generation for Bias Remediation
Antonio Maratea, Rita Perna
TL;DR
This paper tackles bias in AI arising from non-probabilistic sampling and proposes samplation, a non-probabilistic strategy that creates discriminant-value reserves via data augmentation (e.g., SMOTE) and introduces a small, reversely biased synthetic sample during fine-tuning to improve fairness. Applied to a visual semantic role labeling task on the imSitu dataset, samplation demonstrates that a tiny fraction of artificial data can fully cure a substantial initial bias (e.g., $90/10$) with only a modest impact on accuracy. The approach blends ideas from oversampling, active sampling, and reservoir sampling to address bias at the data level during model refinement, offering a cost-effective path toward fairer large pretrained models. The findings suggest practical implications for deployment where data collection is non-random or constrained, highlighting the balance between fairness gains and potential accuracy trade-offs.
Abstract
Adequate sampling space coverage is the keystone to effectively train trustworthy Machine Learning models. Unfortunately, real data do carry several inherent risks due to the many potential biases they exhibit when gathered without a proper random sampling over the reference population, and most of the times this is way too expensive or time consuming to be a viable option. Depending on how training data have been gathered, unmitigated biases can lead to harmful or discriminatory consequences that ultimately hinders large scale applicability of pre-trained models and undermine their truthfulness or fairness expectations. In this paper, a mixed active sampling and data generation strategy -- called samplation -- is proposed as a mean to compensate during fine-tuning of a pre-trained classifer the unfair classifications it produces, assuming that the training data come from a non-probabilistic sampling schema. Given a pre-trained classifier, first a fairness metric is evaluated on a test set, then new reservoirs of labeled data are generated and finally a number of reversely-biased artificial samples are generated for the fine-tuning of the model. Using as case study Deep Models for visual semantic role labeling, the proposed method has been able to fully cure a simulated gender bias starting from a 90/10 imbalance, with only a small percentage of new data and with a minor effect on accuracy.
