Fair Model-based Clustering

Jinwon Park; Kunwoong Kim; Jihu Lee; Yongdai Kim

Fair Model-based Clustering

Jinwon Park, Kunwoong Kim, Jihu Lee, Yongdai Kim

TL;DR

A new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC), which has a main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily.

Abstract

The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc.) in each cluster is similar to that of the entire dataset. Various fair clustering algorithms have been proposed that modify standard K-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on a finite mixture model, called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent of the sample size and thus can be scaled up easily. In particular, mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications for the superiority of the proposed algorithm are provided.

Fair Model-based Clustering

TL;DR

Abstract

Paper Structure (60 sections, 5 theorems, 69 equations, 11 figures, 8 tables, 3 algorithms)

This paper contains 60 sections, 5 theorems, 69 equations, 11 figures, 8 tables, 3 algorithms.

Introduction
Review of Existing Fair Clustering Algorithms
Finite Mixture Models
Assignment Map for the Finite Mixture Model
Learning a Fair Finite Mixture Model
Fairness Constraint for the Finite Mixture Model
Learning Algorithms: FMC-GD and FMC-EM
(1) FMC-GD
(2) FMC-EM
Mini-Batch Learning with Sub-Sampled $\Delta$
Sub-Sample Learning
Numerical Experiments
Experimental Settings
Datasets
Algorithms
...and 45 more sections

Key Result

Proposition 1

Let $n_s=|\{x_i\in \mathcal{D}_n: s_i=s\}|$ for $s\in \{1, 2\}.$ Then, with probability at least $1-\delta,$ we have for some constant $C=C(\xi,\zeta,\nu,x_{\max}),$ where $x_{\max}=\max_{x\in \mathcal{D}} \|x\|_2$ and $n'=\min\{n_1, n_2\}.$

Figures (11)

Figure 1: Pareto front lines between $\Delta$ and Cost on Adult, Bank, and Credit datasets. See \ref{['fig:three_L2_pareto_Balance']} for the lines between Balance and Cost. See \ref{['fig:three_noL2_pareto']} for the similar results without $L_2$ normalization.
Figure 2: Pareto front lines between $\Delta$ and Cost (with standard deviation bands obtained by five random initializations) of FMC-EM (left) and FMC-GD (right) on Adult dataset. See \ref{['fig:10_seeds_remaining']} in Appendix for similar results with respect to Balance and on other datasets.
Figure 3: Pareto front lines between $\{\Delta, \textup{ Balance}\}$ and Cost on Census dataset. See \ref{['fig:census_noL2_pareto']} in Appendix for the similar results without $L_2$ normalization.
Figure 4: First row: Pareto front lines between Balance and Cost with standard deviation bands of FMC-EM and FMC-GD on Adult dataset. Second row: Pareto front lines between $\{\Delta$, Balance$\}$ and Cost with standard deviation bands of FMC-EM and FMC-GD on Bank dataset. Third row: Pareto front lines between $\{\Delta$, Balance$\}$ and Cost with standard deviation bands of FMC-EM and FMC-GD on Credit dataset.
Figure 5: Pareto front lines between Balance and Cost on Adult, Bank, and Credit datasets.
...and 6 more figures

Theorems & Definitions (10)

Proposition 1
Definition 2: Rademacher complexity of a function class mohri2018foundations
Lemma 3: Rademacher generalization bound shalev2014understanding
Lemma 4: Dudley's theorem wolf2023mathematical
Lemma 5: Upper bound of Rademacher complexity
proof
proof : Proof of \ref{['prop:generalization']}
Remark 6
Corollary 7
proof

Fair Model-based Clustering

TL;DR

Abstract

Fair Model-based Clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (10)