Table of Contents
Fetching ...

A Computational Approach to Improving Fairness in K-means Clustering

Guancheng Zhou, Haiping Xu, Hongkang Xu, Chenyu Li, Donghui Yan

TL;DR

This work addresses fairness in K-means clustering, where clusters can disproportionately contain data from certain subpopulations. It proposes a two-stage approach: first perform standard clustering to obtain high-quality partitions, then adjust the membership of a small set of near-boundary points to improve fairness, thereby reducing bias with minimal loss in clustering quality. Two scalable heuristics are introduced—near-foreign, which targets far-from-centroid points near another cluster, and a Gini-index-based method that identifies highly mixed boundary points—along with formal definitions of fairness via $\mathcal{F}$ and cluster balance $\beta(A)$. Empirical results on seven UCI datasets show meaningful fairness gains with only slight perturbations to the clustering quality metric $\kappa$, demonstrating the methods’ practicality and broad applicability to other clustering algorithms and fairness notions.

Abstract

The popular K-means clustering algorithm potentially suffers from a major weakness for further analysis or interpretation. Some cluster may have disproportionately more (or fewer) points from one of the subpopulations in terms of some sensitive variable, e.g., gender or race. Such a fairness issue may cause bias and unexpected social consequences. This work attempts to improve the fairness of K-means clustering with a two-stage optimization formulation--clustering first and then adjust cluster membership of a small subset of selected data points. Two computationally efficient algorithms are proposed in identifying those data points that are expensive for fairness, with one focusing on nearest data points outside of a cluster and the other on highly 'mixed' data points. Experiments on benchmark datasets show substantial improvement on fairness with a minimal impact to clustering quality. The proposed algorithms can be easily extended to a broad class of clustering algorithms or fairness metrics.

A Computational Approach to Improving Fairness in K-means Clustering

TL;DR

This work addresses fairness in K-means clustering, where clusters can disproportionately contain data from certain subpopulations. It proposes a two-stage approach: first perform standard clustering to obtain high-quality partitions, then adjust the membership of a small set of near-boundary points to improve fairness, thereby reducing bias with minimal loss in clustering quality. Two scalable heuristics are introduced—near-foreign, which targets far-from-centroid points near another cluster, and a Gini-index-based method that identifies highly mixed boundary points—along with formal definitions of fairness via and cluster balance . Empirical results on seven UCI datasets show meaningful fairness gains with only slight perturbations to the clustering quality metric , demonstrating the methods’ practicality and broad applicability to other clustering algorithms and fairness notions.

Abstract

The popular K-means clustering algorithm potentially suffers from a major weakness for further analysis or interpretation. Some cluster may have disproportionately more (or fewer) points from one of the subpopulations in terms of some sensitive variable, e.g., gender or race. Such a fairness issue may cause bias and unexpected social consequences. This work attempts to improve the fairness of K-means clustering with a two-stage optimization formulation--clustering first and then adjust cluster membership of a small subset of selected data points. Two computationally efficient algorithms are proposed in identifying those data points that are expensive for fairness, with one focusing on nearest data points outside of a cluster and the other on highly 'mixed' data points. Experiments on benchmark datasets show substantial improvement on fairness with a minimal impact to clustering quality. The proposed algorithms can be easily extended to a broad class of clustering algorithms or fairness metrics.

Paper Structure

This paper contains 8 sections, 6 equations, 5 figures, 3 tables, 3 algorithms.

Figures (5)

  • Figure 1: Illustration of the fairness issue in clustering, Points of different color indicate different traits on a sensitive variable, e.g., gender where blue indicates male and red female. Cluster 1 is dominated by females while Cluster 2 by males. Points with an arrow indicate that we might switch its cluster membership assignment to make the clusters less dominated by one subpopulation.
  • Figure 2: Illustration of connectivity of data points in the same cluster. If one switches point $a$ to the red cluster, then green points between it and the red cluster (i.e., points enclosed by the dashed curve) should also be switched.
  • Figure 3: Illustration of the near-foreign heuristic. Points from different classes are indicated by different colors. $\star$ indicates the centroid the clusters formed by red points or green points.
  • Figure 4: Illustration of boundary and non-boundary points. Points from different classes are indicated by different colors. The circle are the 10-nearest neighborhood of given points $a$, $b$ and $c$. The respective Gini indices are calculated as $0.66, 0.50$, and $0$ (which indicates the given point is an interior point).
  • Figure 5: Insensitivity of the cluster quality over different choices of neighborhood size $k \in \{5, 10, 15\}$ in calculating the Gini index of individual data points. The cluster qualities vary very little when $k$ increases from $5$ to $15$.