Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation

Yifei Wang; Zhengyang Zhou; Liqin Wang; John Laurentiev; Peter Hou; Li Zhou; Pengyu Hong

Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation

Yifei Wang, Zhengyang Zhou, Liqin Wang, John Laurentiev, Peter Hou, Li Zhou, Pengyu Hong

TL;DR

A counterpart-based statistical fairness index, called Counterpart Fairness (CFair), is introduced, to assess the fairness of machine learning models and indicates that standard group-based fairness metrics may not adequately inform about the degree of unfairness present in predictions, as revealed through CFair.

Abstract

When using machine learning to aid decision-making, it is critical to ensure that an algorithmic decision is fair and does not discriminate against specific individuals/groups, particularly those from underprivileged populations. Existing group fairness methods aim to ensure equal outcomes (such as loan approval rates) across groups delineated by protected variables like race or gender. However, in cases where systematic differences between groups play a significant role in outcomes, these methods may overlook the influence of non-protected variables that can systematically vary across groups. These confounding factors can affect fairness evaluations, making it challenging to assess whether disparities are due to discrimination or inherent differences. Therefore, we recommend a more refined and comprehensive fairness index that accounts for both the systematic differences within groups and the multifaceted, intertwined confounding effects. The proposed index evaluates fairness on counterparts (pairs of individuals who are similar with respect to the task of interest but from different groups), whose group identities cannot be distinguished algorithmically by exploring confounding factors. To identify counterparts, we developed a two-step matching method inspired by propensity score and metric learning. In addition, we introduced a counterpart-based statistical fairness index, called Counterpart Fairness (CFair), to assess the fairness of machine learning models. Empirical results on the MIMIC and COMPAS datasets indicate that standard group-based fairness metrics may not adequately inform about the degree of unfairness present in predictions, as revealed through CFair.

Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation

TL;DR

Abstract

Paper Structure (47 sections, 1 theorem, 12 equations, 7 figures, 14 tables)

This paper contains 47 sections, 1 theorem, 12 equations, 7 figures, 14 tables.

Introduction
Preliminaries
Revisiting Demographic Parity -- A Popular Group Fairness Index
DP Gap Distorted by Systematic Differences
Remark 1.
Remark 2.
Method
Counterpart Fairness
Counterparts
CFair: Fairness on Counterparts Between Groups
An Implementation of CFair
Propensity Score Matching
Identifying 1-1 Counterparts
Experiments
Significant Systematic Differences Revealed by Propensity Score
...and 32 more sections

Key Result

Corollary 3.3

Given two groups $G_0$ and $G_1$, both $C_{0, \delta}$ and $C_{1, \delta}$ are unique.

Figures (7)

Figure 1: DP gap and biases. (A) $\Delta_{\text{DP}}$ = 0 if two sample groups follow the same underlying distributions. (B) When distributions of two groups are substantially different, the true $\Delta_{\text{DP}}$ should significantly deviate from 0. (C) Biased sampling could distort DP gap estimation. In this example, the distributions (curves) of two groups are the same, and their true $\Delta_{\text{DP}}$ should be 0. However, the difference in their sample distributions (bars) leads to a large estimated $\widehat{\Delta_{\text{DP}}}$.
Figure 2: Identify 1-1 counterparts. (A) Potential confounding factors (PCF) are a subset of non-protected variables used by an ML model for predicting outcomes, and are strongly associated with the protected variable, which can be explored by ML to accurately predict the protected variable. In this way, the protected variable can "dictate" the outcomes of the ML model (i.e., the model is biased) even though it is not used in training, which can mislead fairness evaluation. (B) Propensity score matching (PSM) is used to identify initial matches between individuals in groups $G_0$ and $G_1$, among which the association between the PCF and the protected variable is weak. (C) The initial matches are then refined by considering the between-individual similarities in their baseline characteristics. This step produces the 1-1 counterparts between the subgroups identified by PSM.
Figure 3: Comparing the propensity score distributions. (A) Black vs White in the case of sepsis patients in the MIMIC dataset, (B) Black vs White in the COMPAS dataset, (C) Male vs Female in the German Banking dataset, and (D) Black vs White in the Adult dataset. Systematic differences are observed in all datasets. Especially, the propensity score distributions of two groups in both (C) and (D) have no overlap at all and their concentrations are well-separated .
Figure F.1: Systematic differences are observed between males and females in the German Banking dataset germancredit1994. The target is to predict the risk of a client ("good" vs "bad"). (A) Males and females have different loan purposes. Hence, it is reasonable for banks to treat their loan applications differently, for example, by enforcing different levels of scrutiny and requiring different documentations. (B) The age distributions of male and female applicants are statistically significantly different (the Kolmogorov-Smirnov test massey1951kolmogorov$p$-value < 0.05) in several loan categories: radio/TV, car, and domestic appliance. Since age is an important factor in making loan decisions by lenders, it is expected that banks would treat the applications from males and females differently even if they have the same loan purpose. (C) The credit amount distributions also significantly differ between males and females in several borrowing purpose categories. In general, females had a lower average credit compared to males across many load purpose categories.
Figure F.2: Changes of systematic difference levels with respect to the stringency of counterpart selection in the MIMIC experiment. Loosening the counterpart similarity constraint leads to larger counterpart groups, however, increases systematic differences indicated by the increasing number of features whose means are significantly different between groups (evaluated by $t$-test, $p$-value significant level at 0.05).
...and 2 more figures

Theorems & Definitions (7)

Definition 2.1
Definition 2.2
Definition 3.1: $\delta$-element and $\delta$-counterpart
Definition 3.2: $\delta$-group
Corollary 3.3
Definition 3.4: 1-1 $\delta$-counterpart groups
proof

Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation

TL;DR

Abstract

Counterpart Fairness -- Addressing Systematic between-group Differences in Fairness Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (7)