Improved Streaming Algorithm for Fair $k$-Center Clustering
Longkun Guo, Zeyu Lin, Chaoqi Jia, Chao Chen
TL;DR
The paper tackles fair $k$-center clustering under streaming constraints by introducing a two-stage framework that buffers representative points via a $\lambda$-independent center set and then selects centers from that reserved subset. It achieves a 5-approximation in the streaming model with $O(k\log n)$ memory and extends to semi-structured data streams with 3- and 4-approximations for special cases, while also enabling a 3-approximation for the offline problem. A polynomial-time approach using an auxiliary bipartite graph turns Case (3) into a constrained vertex-cover problem, preserving fairness constraints. Empirical results on real and simulated datasets demonstrate improved clustering cost and runtime relative to baselines, and the 5-approximation bound is shown to be tight under sublinear memory, underscoring the practical impact of the method.
Abstract
Many real-world applications pose challenges in incorporating fairness constraints into the $k$-center clustering problem, where the dataset consists of $m$ demographic groups, each with a specified upper bound on the number of centers to ensure fairness. Focusing on big data scenarios, this paper addresses the problem in a streaming setting, where data points arrive one by one sequentially in a continuous stream. Leveraging a structure called the $λ$-independent center set, we propose a one-pass streaming algorithm that first computes a reserved set of points during the streaming process. Then, for the post-streaming process, we propose an approach for selecting centers from the reserved point set by analyzing all three possible cases, transforming the most complicated one into a specially constrained vertex cover problem in an auxiliary graph. Our algorithm achieves a tight approximation ratio of 5 while consuming $O(k\log n)$ memory. It can also be readily adapted to solve the offline fair $k$-center problem, achieving a 3-approximation ratio that matches the current state of the art. Furthermore, we extend our approach to a semi-structured data stream, where data points from each group arrive in batches. In this setting, we present a 3-approximation algorithm for $m = 2$ and a 4-approximation algorithm for general $m$. Lastly, we conduct extensive experiments to evaluate the performance of our approaches, demonstrating that they outperform existing baselines in both clustering cost and runtime efficiency.
