Table of Contents
Fetching ...

Improved Streaming Algorithm for Fair $k$-Center Clustering

Longkun Guo, Zeyu Lin, Chaoqi Jia, Chao Chen

TL;DR

The paper tackles fair $k$-center clustering under streaming constraints by introducing a two-stage framework that buffers representative points via a $\lambda$-independent center set and then selects centers from that reserved subset. It achieves a 5-approximation in the streaming model with $O(k\log n)$ memory and extends to semi-structured data streams with 3- and 4-approximations for special cases, while also enabling a 3-approximation for the offline problem. A polynomial-time approach using an auxiliary bipartite graph turns Case (3) into a constrained vertex-cover problem, preserving fairness constraints. Empirical results on real and simulated datasets demonstrate improved clustering cost and runtime relative to baselines, and the 5-approximation bound is shown to be tight under sublinear memory, underscoring the practical impact of the method.

Abstract

Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to ensure fairness. Focusing on big data scenarios, this paper addresses the problem in a streaming setting, where data points arrive one by one sequentially in a continuous stream. Leveraging a structure called the $λ$-independent center set, we propose a one-pass streaming algorithm that first computes a reserved set of points during the streaming process. Then, for the post-streaming process, we propose an approach for selecting centers from the reserved point set by analyzing all three possible cases, transforming the most complicated one into a specially constrained vertex cover problem in an auxiliary graph. Our algorithm achieves a tight approximation ratio of 5 while consuming $O(k\log n)$ memory. It can also be readily adapted to solve the offline fair $k$-center problem, achieving a 3-approximation ratio that matches the current state of the art. Furthermore, we extend our approach to a semi-structured data stream, where data points from each group arrive in batches. In this setting, we present a 3-approximation algorithm for $m = 2$ and a 4-approximation algorithm for general $m$. Lastly, we conduct extensive experiments to evaluate the performance of our approaches, demonstrating that they outperform existing baselines in both clustering cost and runtime efficiency.

Improved Streaming Algorithm for Fair $k$-Center Clustering

TL;DR

The paper tackles fair -center clustering under streaming constraints by introducing a two-stage framework that buffers representative points via a -independent center set and then selects centers from that reserved subset. It achieves a 5-approximation in the streaming model with memory and extends to semi-structured data streams with 3- and 4-approximations for special cases, while also enabling a 3-approximation for the offline problem. A polynomial-time approach using an auxiliary bipartite graph turns Case (3) into a constrained vertex-cover problem, preserving fairness constraints. Empirical results on real and simulated datasets demonstrate improved clustering cost and runtime relative to baselines, and the 5-approximation bound is shown to be tight under sublinear memory, underscoring the practical impact of the method.

Abstract

Many real-world applications pose challenges in incorporating fairness constraints into the -center clustering problem, where the dataset consists of demographic groups, each with a specified upper bound on the number of centers to ensure fairness. Focusing on big data scenarios, this paper addresses the problem in a streaming setting, where data points arrive one by one sequentially in a continuous stream. Leveraging a structure called the -independent center set, we propose a one-pass streaming algorithm that first computes a reserved set of points during the streaming process. Then, for the post-streaming process, we propose an approach for selecting centers from the reserved point set by analyzing all three possible cases, transforming the most complicated one into a specially constrained vertex cover problem in an auxiliary graph. Our algorithm achieves a tight approximation ratio of 5 while consuming memory. It can also be readily adapted to solve the offline fair -center problem, achieving a 3-approximation ratio that matches the current state of the art. Furthermore, we extend our approach to a semi-structured data stream, where data points from each group arrive in batches. In this setting, we present a 3-approximation algorithm for and a 4-approximation algorithm for general . Lastly, we conduct extensive experiments to evaluate the performance of our approaches, demonstrating that they outperform existing baselines in both clustering cost and runtime efficiency.

Paper Structure

This paper contains 24 sections, 8 theorems, 3 equations, 2 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

For a minimal $\lambda$-independent center set $\Gamma\subseteq S$, if $\lambda \geq 2r^{*}$, then $|\Gamma|\leq k$.

Figures (2)

  • Figure 1: Empirical approximation ratio ($cost/r^*$) of our algorithms in comparison with other baselines.
  • Figure 2: Runtime on the 100G simulated dataset.

Theorems & Definitions (15)

  • Definition 1
  • Lemma 1
  • Theorem 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 5 more