Feature-Weighted Maximum Representative Subsampling

Tony Hauptmann; Stefan Kramer

Feature-Weighted Maximum Representative Subsampling

Tony Hauptmann, Stefan Kramer

TL;DR

A method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks.

Abstract

In the social sciences, it is often necessary to debias studies and surveys before valid conclusions can be drawn. Debiasing algorithms enable the computational removal of bias using sample weights. However, an issue arises when only a subset of features is highly biased, while the rest is already representative. Algorithms need to strongly alter the sample distribution to manage a few highly biased features, which can in turn introduce bias into already representative variables. To address this issue, we developed a method that uses feature weights to minimize the impact of highly biased features on the computation of sample weights. Our algorithm is based on Maximum Representative Subsampling (MRS), which debiases datasets by aligning a non-representative sample with a representative one through iterative removal of elements to create a representative subsample. The new algorithm, named feature-weighted MRS (FW-MRS), decreases the emphasis on highly biased features, allowing it to retain more instances for downstream tasks. The feature weights are derived from the feature importance of a domain classifier trained to differentiate between the representative and non-representative datasets. We validated FW-MRS using eight tabular datasets, each of which we artificially biased. Biased features can be important for downstream tasks, and focusing less on them could lead to a decline in generalization. For this reason, we assessed the generalization performance of FW-MRS on downstream tasks and found no statistically significant differences. Additionally, FW-MRS was applied to a real-world dataset from the social sciences. The source code is available at https://github.com/kramerlab/FeatureWeightDebiasing.

Feature-Weighted Maximum Representative Subsampling

TL;DR

Abstract

Paper Structure (14 sections, 3 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 3 equations, 7 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Feature-weighted Maximum Representative Subsampling
Experimental Results
Experimental Setup
Temperature Comparison
Downstream Task
Real-world Dataset
Discussion
Dataset Characteristics
Bias-Variance Decomposition
Distribution Alignment
AUROC per Iteration for GBS
Feature Importance GBS

Figures (7)

Figure 1: Schema for feature-weighted maximum representative sampling (FW-MRS). Two surveys stem from the same population: one is biased and includes the variable under investigation, while the other is representative but does not include it. FW-MRS mitigates bias by comparing the distributions of the biased and representative datasets using a classifier that leverages auxiliary information from the representative dataset to remove samples from the biased dataset. The algorithm returns a representative subset and feature weights that align the non-representative study to the distribution of the representative one.
Figure 2: Validation AUROC vs. number of dropped samples: All hyperparameters were fixed except for the temperature, which was varied. Each point denotes the mean AUROC across runs, with ellipses indicating the standard deviation. Circles correspond to FW-MRS$_{RF}$ and the triangle represents MRS.
Figure 3: Relative dropped samples and AUROC for 50 iterations with 10 times repeated 5-fold cross-validation for FW-MRS$_{RF}$. Subfigure a) shows the relative dropped samples, and b) the corresponding AUROC. Hyperparameters were optimized on $N$ for the downstream task, and the effect of different temperature settings is compared.
Figure 4: Relative dropped samples over 50 iterations with 10 times repeated 5-fold cross-validation.
Figure 5: Feature weights used in FW-MRS$_{RF}$ debiasing of GBS with auxiliary information of Allensbach.
...and 2 more figures

Feature-Weighted Maximum Representative Subsampling

TL;DR

Abstract

Feature-Weighted Maximum Representative Subsampling

Authors

TL;DR

Abstract

Table of Contents

Figures (7)