A Scalable Algorithm for Individually Fair K-means Clustering

MohammadHossein Bateni; Vincent Cohen-Addad; Alessandro Epasto; Silvio Lattanzi

A Scalable Algorithm for Individually Fair K-means Clustering

MohammadHossein Bateni, Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi

TL;DR

This work tackles scalable, individually fair clustering under per-point radius constraints by introducing ConstrainedLocalSearch++, a fast local-search algorithm with seeding and anchor-zone mechanisms. It achieves a bicriteria $(O(1),6)$-approximation for the radii while attaining a constant-factor approximation on the $k$-means cost, running in $ ilde{O}(nd + nk^2)$ time. The algorithm is both theoretically grounded and empirically validated, demonstrating substantial speedups and often lower costs than previous methods on large real-world datasets. These results make individually fair clustering practical at scale and open avenues for tighter theoretical bounds and generalizations to broader objective functions.

Abstract

We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. Given $n$ points $P$ in a metric space, let $δ(x)$ for $x\in P$ be the radius of the smallest ball around $x$ containing at least $n / k$ points. A clustering is then called individually fair if it has centers within distance $δ(x)$ of $x$ for each $x\in P$. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~$O(nk^2)$ time and obtains a bicriteria $(O(1), 6)$ approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.

A Scalable Algorithm for Individually Fair K-means Clustering

TL;DR

-approximation for the radii while attaining a constant-factor approximation on the

-means cost, running in

time. The algorithm is both theoretically grounded and empirically validated, demonstrating substantial speedups and often lower costs than previous methods on large real-world datasets. These results make individually fair clustering practical at scale and open avenues for tighter theoretical bounds and generalizations to broader objective functions.

Abstract

We present a scalable algorithm for the individually fair (

)-clustering problem introduced by Jung et al. and Mahabadi et al. Given

points

in a metric space, let

for

be the radius of the smallest ball around

containing at least

points. A clustering is then called individually fair if it has centers within distance

for each

. While good approximation algorithms are known for this problem no efficient practical algorithms with good theoretical guarantees have been presented. We design the first fast local-search algorithm that runs in ~

time and obtains a bicriteria

approximation. Then we show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.

Paper Structure (26 sections, 11 theorems, 22 equations, 2 figures, 4 tables, 3 algorithms)

This paper contains 26 sections, 11 theorems, 22 equations, 2 figures, 4 tables, 3 algorithms.

Introduction
Preliminaries
Fast algorithm
Analysis (Proof of \ref{['th:ls++']})
Proof of \ref{['lem:main']}
Empirical analysis
Datasets.
Algorithms.
Experiments on the full datasets.
Conclusions and Future Works
Additional experimental results
Small scale datasets and effect of F-Lloyd improvement
Effect of $k$
Additional results on large-scale datasets
Standard deviation of the metrics in large datasets.
...and 11 more sections

Key Result

Theorem 1.1

There is an $\tilde{O}(nd+ nk^2)$-time algorithm for individually fair $k$-means with a 6-approximation for radii and an $O(1)$-approximation on costs.

Figures (2)

Figure 1: Mean completion time, cost, and bound ratio for the algorithms on Gowalla dataset subsampled to different sizes, $k=10$. The shades represent the $95\%$ confidence interval (notice that some algorithms are deterministic). Runs that did not complete in $1$ hour on the sample are not reported. VanillaKMeans bound ratio is > 60 in all runs and not show in the plot as out of scale).
Figure 2: Mean completion cost and bound ratio for the algorithms on the adult dataset subsampled to 1000 elements and different $k$'s. The shades represent the $95\%$ confidence interval.

Theorems & Definitions (22)

Theorem 1.1
Lemma 3.1
Lemma 3.2
Lemma 3.3
Lemma 3.4
Proposition 3.5
Definition 3.6
Lemma 3.7
Definition 3.8
Lemma 3.9
...and 12 more

A Scalable Algorithm for Individually Fair K-means Clustering

TL;DR

Abstract

A Scalable Algorithm for Individually Fair K-means Clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (22)