Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

Manh Khoi Duong; Stefan Conrad

Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

Manh Khoi Duong, Stefan Conrad

TL;DR

A framework for addressing fairness by debiasing datasets containing a (non-)binary protected attribute is presented and it is demonstrated that under this framework, genetic algorithms can effectively yield fairer datasets compared to the original data.

Abstract

The reason behind the unfair outcomes of AI is often rooted in biased datasets. Therefore, this work presents a framework for addressing fairness by debiasing datasets containing a (non-)binary protected attribute. The framework proposes a combinatorial optimization problem where heuristics such as genetic algorithms can be used to solve for the stated fairness objectives. The framework addresses this by finding a data subset that minimizes a certain discrimination measure. Depending on a user-defined setting, the framework enables different use cases, such as data removal, the addition of synthetic data, or exclusive use of synthetic data. The exclusive use of synthetic data in particular enhances the framework's ability to preserve privacy while optimizing for fairness. In a comprehensive evaluation, we demonstrate that under our framework, genetic algorithms can effectively yield fairer datasets compared to the original data. In contrast to prior work, the framework exhibits a high degree of flexibility as it is metric- and task-agnostic, can be applied to both binary or non-binary protected attributes, and demonstrates efficient runtime.

Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

TL;DR

Abstract

Paper Structure (20 sections, 9 equations, 3 figures, 4 tables)

This paper contains 20 sections, 9 equations, 3 figures, 4 tables.

Introduction
Related Work
Measuring Discrimination
Absolute Measures
Optimization Framework
Problem Formulation
Removing Samples ($S = \mathcal{D}$)
Employing Only Synthetic Data ($S = G$)
Merging Real and Synthetic Data ($S = \mathcal{D} \cup G$)
Adding Synthetic Data
Heuristics
Evaluation
Hyperparameter Tuning
Discrimination
Runtime
...and 5 more sections

Figures (3)

Figure 1: The pipeline consists of three steps: (1) The user sets the sample set $S$ and other settings, including the objective, discrimination measure, and protected attribute; (2) Synthetic data is generated if needed; (3) A solver optimizes the fairness objective to yield a biased-reduced subset $\mathcal{D}_\text{fair}$ from the user-selected set $S$. If $S = G$ was chosen, the user obtains a bias-reduced synthetic dataset that does not leak privacy-related information.
Figure 2: Heatmaps showing discrimination scores ($\psi_\text{SDP-sum}$) after pre-processing with genetic algorithms using different population sizes (y-axis) and generations (x-axis). Rows depict the results of Adult, Bank, and COMPAS datasets, while columns represent the objectives.
Figure 3: Heatmaps showing runtimes in seconds for the Bank dataset after pre-processing with genetic algorithms using different population sizes (y-axis) and generations (x-axis).

Theorems & Definitions (3)

definition 1: Statistical parity
definition 2: Sum of absolute statistical disparities
definition 3: Maximal absolute statistical disparity

Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

TL;DR

Abstract

Towards Fairness and Privacy: A Novel Data Pre-processing Optimization Framework for Non-binary Protected Attributes

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (3)