Table of Contents
Fetching ...

Feature-Weighted Maximum Representative Subsampling

Tony Hauptmann, Stefan Kramer

TL;DR

A method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks.

Abstract

In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.

Feature-Weighted Maximum Representative Subsampling

TL;DR

A method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks.

Abstract

In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.
Paper Structure (14 sections, 3 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 3 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Schema for feature-weighted maximum representative sampling (FW-MRS). Two surveys stem from the same population: one is biased and includes the variable under investigation, while the other is representative but does not include it. FW-MRS mitigates bias by comparing the distributions of the biased and representative datasets using a classifier that leverages auxiliary information from the representative dataset to remove samples from the biased dataset. The algorithm returns a representative subset and feature weights that align the non-representative study to the distribution of the representative one.
  • Figure 2: Validation AUROC vs. number of dropped samples: All hyperparameters were fixed except for the temperature, which was varied. Each point denotes the mean AUROC across runs, with ellipses indicating the standard deviation. Circles correspond to FW-MRS$_{RF}$ and the triangle represents MRS.
  • Figure 3: Relative dropped samples and AUROC for 50 iterations with 10 times repeated 5-fold cross-validation for FW-MRS$_{RF}$. Subfigure a) shows the relative dropped samples, and b) the corresponding AUROC. Hyperparameters were optimized on $N$ for the downstream task, and the effect of different temperature settings is compared.
  • Figure 4: Relative dropped samples over 50 iterations with 10 times repeated 5-fold cross-validation.
  • Figure 5: Feature weights used in FW-MRS$_{RF}$ debiasing of GBS with auxiliary information of Allensbach.
  • ...and 2 more figures