Table of Contents
Fetching ...

Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems

Shihong Song, Guanlin Mo, Qingyuan Yang, Hu Ding

TL;DR

The paper addresses fair clustering under $(\alpha,\beta)$-fair constraints for $k$-means and explores the related $k$-sparse Wasserstein Barycenter problem in Euclidean space. It introduces the Relax and Merge framework, which leverages an $\epsilon$-approximate centroid set to construct a relaxed candidate center set, solves a fair LP to obtain a fractional assignment, and then merges via a vanilla $k$-means step to obtain a final center set with strong approximation guarantees. The key contributions are: (i) a fractional $(1+4\rho+O(\epsilon))$-approximation for fair $k$-means and $k$-WB (with a $(5+O(\epsilon))$-approximation under a PTAS for vanilla $k$-means), (ii) a $(2+6\rho)$-approximation for strictly fair no-violation $k$-means, and (iii) comprehensive experiments showing substantial improvements over baselines. These results advance both the theoretical guarantees and practical performance for fair clustering and transport-based barycenter problems in low-dimensional spaces.

Abstract

The fairness of clustering algorithms has gained widespread attention across various areas, including machine learning, In this paper, we study fair $k$-means clustering in Euclidean space. Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group within specified lower and upper bounds. Due to these fairness constraints, determining the optimal locations of $k$ centers is a quite challenging task. We propose a novel ``Relax and Merge'' framework that returns a $(1+4ρ+ O(ε))$-approximate solution, where $ρ$ is the approximate ratio of an off-the-shelf vanilla $k$-means algorithm and $O(ε)$ can be an arbitrarily small positive number. If equipped with a PTAS of $k$-means, our solution can achieve an approximation ratio of $(5+O(ε))$ with only a slight violation of the fairness constraints, which improves the current state-of-the-art approximation guarantee. Furthermore, using our framework, we can also obtain a $(1+4ρ+O(ε))$-approximate solution for the $k$-sparse Wasserstein Barycenter problem, which is a fundamental optimization problem in the field of optimal transport, and a $(2+6ρ)$-approximate solution for the strictly fair $k$-means clustering with no violation, both of which are better than the current state-of-the-art methods. In addition, the empirical results demonstrate that our proposed algorithm can significantly outperform baseline approaches in terms of clustering cost.

Relax and Merge: A Simple Yet Effective Framework for Solving Fair $k$-Means and $k$-sparse Wasserstein Barycenter Problems

TL;DR

The paper addresses fair clustering under -fair constraints for -means and explores the related -sparse Wasserstein Barycenter problem in Euclidean space. It introduces the Relax and Merge framework, which leverages an -approximate centroid set to construct a relaxed candidate center set, solves a fair LP to obtain a fractional assignment, and then merges via a vanilla -means step to obtain a final center set with strong approximation guarantees. The key contributions are: (i) a fractional -approximation for fair -means and -WB (with a -approximation under a PTAS for vanilla -means), (ii) a -approximation for strictly fair no-violation -means, and (iii) comprehensive experiments showing substantial improvements over baselines. These results advance both the theoretical guarantees and practical performance for fair clustering and transport-based barycenter problems in low-dimensional spaces.

Abstract

The fairness of clustering algorithms has gained widespread attention across various areas, including machine learning, In this paper, we study fair -means clustering in Euclidean space. Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group within specified lower and upper bounds. Due to these fairness constraints, determining the optimal locations of centers is a quite challenging task. We propose a novel ``Relax and Merge'' framework that returns a -approximate solution, where is the approximate ratio of an off-the-shelf vanilla -means algorithm and can be an arbitrarily small positive number. If equipped with a PTAS of -means, our solution can achieve an approximation ratio of with only a slight violation of the fairness constraints, which improves the current state-of-the-art approximation guarantee. Furthermore, using our framework, we can also obtain a -approximate solution for the -sparse Wasserstein Barycenter problem, which is a fundamental optimization problem in the field of optimal transport, and a -approximate solution for the strictly fair -means clustering with no violation, both of which are better than the current state-of-the-art methods. In addition, the empirical results demonstrate that our proposed algorithm can significantly outperform baseline approaches in terms of clustering cost.

Paper Structure

This paper contains 29 sections, 12 theorems, 21 equations, 11 figures, 9 tables, 2 algorithms.

Key Result

Proposition 1

Given a finite weighted point set $Q\subset \mathbb{R}^d$, for any point $a$, $\sum_{q\in Q}w(q)\Vert a - q \Vert^2 = \sum_{q\in Q}w(q) \Vert q-\mathtt{Cen}(Q)\Vert ^2 + w(Q)\cdot \Vert a-\mathtt{Cen}(Q)\Vert ^2$, where $w(Q)$ is the total weight of $Q$.

Figures (11)

  • Figure 1: The cost obtained by the algorithms with different $k$.
  • Figure 2: The cost of strictly fair $k$-means.
  • Figure 3: The difference between the location of $k$-means clustering centers and the fair $k$-means clustering centers. The input dataset contains 3 different groups represented by orange, blue, and green points respectively. The red diamonds represent the cluster centers under different assumptions for the clustering problem. (a) shows the clustering result of $k$-means, while (b) shows the clustering result of fair $k$-means.
  • Figure 4: The instance of the minimum cost circulation problem established through $(S, \phi_S^*)$. The upper and lower bounds of the flow for each arc are annotated in the graph.
  • Figure 5: Comparison on Clustering Cost with $\delta = 0.1$
  • ...and 6 more figures

Theorems & Definitions (22)

  • Proposition 1
  • Definition 1
  • Remark 1
  • Definition 2: Wasserstein Distance
  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • ...and 12 more