Table of Contents
Fetching ...

Towards Poisoning Fair Representations

Tianci Liu, Haoyu Wang, Feijie Wu, Hengtong Zhang, Pan Li, Lu Su, Jing Gao

TL;DR

This work proposes the first data poisoning framework attacking FRL, and induces the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data.

Abstract

Fair machine learning seeks to mitigate model prediction bias against certain demographic subgroups such as elder and female. Recently, fair representation learning (FRL) trained by deep neural networks has demonstrated superior performance, whereby representations containing no demographic information are inferred from the data and then used as the input to classification or other downstream tasks. Despite the development of FRL methods, their vulnerability under data poisoning attack, a popular protocol to benchmark model robustness under adversarial scenarios, is under-explored. Data poisoning attacks have been developed for classical fair machine learning methods which incorporate fairness constraints into shallow-model classifiers. Nonetheless, these attacks fall short in FRL due to notably different fairness goals and model architectures. This work proposes the first data poisoning framework attacking FRL. We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data. This attack entails a prohibitive bilevel optimization, wherefore an effective approximated solution is proposed. A theoretical analysis on the needed number of poisoning samples is derived and sheds light on defending against the attack. Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.

Towards Poisoning Fair Representations

TL;DR

This work proposes the first data poisoning framework attacking FRL, and induces the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data.

Abstract

Fair machine learning seeks to mitigate model prediction bias against certain demographic subgroups such as elder and female. Recently, fair representation learning (FRL) trained by deep neural networks has demonstrated superior performance, whereby representations containing no demographic information are inferred from the data and then used as the input to classification or other downstream tasks. Despite the development of FRL methods, their vulnerability under data poisoning attack, a popular protocol to benchmark model robustness under adversarial scenarios, is under-explored. Data poisoning attacks have been developed for classical fair machine learning methods which incorporate fairness constraints into shallow-model classifiers. Nonetheless, these attacks fall short in FRL due to notably different fairness goals and model architectures. This work proposes the first data poisoning framework attacking FRL. We induce the model to output unfair representations that contain as much demographic information as possible by injecting carefully crafted poisoning samples into the training data. This attack entails a prohibitive bilevel optimization, wherefore an effective approximated solution is proposed. A theoretical analysis on the needed number of poisoning samples is derived and sheds light on defending against the attack. Experiments on benchmark fairness datasets and state-of-the-art fair representation learning models demonstrate the superiority of our attack.
Paper Structure (27 sections, 2 theorems, 21 equations, 18 figures, 1 table, 1 algorithm)

This paper contains 27 sections, 2 theorems, 21 equations, 18 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.4

Suppose that Assumption asm:smooth, asm:well-train, and asm:match hold. Let $P$ and $N$ be the number of poisoning and total training samples, respectively. Set the learning rate to $\alpha$ and the batch size to $n$. Then, the ratio of poisoning data $P/N$ should satisfy such that the upper-level loss $U(\theta)$ is asymptotic to an optimal model. Here $c$ is a small constant (e.g., $10^{-4}$) f

Figures (18)

  • Figure 1: Framework of data poisoning attack (in red) on FRL. Irrelevant components such as class labels are omitted. The attacker poisons training data to contaminate the victim training (solid lines), resulting in unfair representations $\boldsymbol{z}$ for target data (dotted lines) such that its MI to sensitive feature $a$ is maximized. The MI is supposed to be small before the attack.
  • Figure 2: ENG-based attacks reduce BCE loss more than AA baselines with less portion of poisoning samples. Results are averaged over 5 independent replications and bands show standard errors.
  • Figure 3: Decrease of BCE loss and $L_1$ norm of perturbations learned by ENG-FLD attack. Victims are trained on Adult dataset and results are averaged over 5 replications.
  • Figure 4: Changes of FLD, sFLD, and EUC loss (the negative score) and corresponding BCE loss. Results are averaged over 5 independent replications and bands show standard errors.
  • Figure 5: Increase of DP violations from different attackers using 5% - 15% training samples for poisoning, Results are averaged over 5 independent replications and bands show standard errors.
  • ...and 13 more figures

Theorems & Definitions (4)

  • Remark 2.1
  • Theorem 3.4
  • Theorem C.1
  • proof