Table of Contents
Fetching ...

A Generic Framework for Fair Consensus Clustering in Streams

Diptarka Chakraborty, Kushagra Chatterjee, Debarati Das, Tien-Long Nguyen

TL;DR

The paper addresses fair consensus clustering in data streams, introducing the first constant-factor streaming algorithms for both 1-median and k-median objectives under fairness constraints. It develops a generic two-phase framework that combines closest fair clustering with cluster fitting and uses logarithmic-sample strategies to achieve sublinear space, with a $(\gamma+1.995)$-approximation for 1-median and a $(1.0151\gamma+1.99951)$-approximation for k-median, assuming access to efficient closest-fair subroutines. The contributions include a modular, fairness-agnostic approach, near-optimal space complexity, and extensions to the streaming setting, plus concrete corollaries for common two-color fairness regimes. The results advance scalable, fair ensemble clustering in dynamic environments and have potential impact on federated, streaming analyses and real-time decision support where fairness is a concern.

Abstract

Consensus clustering seeks to combine multiple clusterings of the same dataset, potentially derived by considering various non-sensitive attributes by different agents in a multi-agent environment, into a single partitioning that best reflects the overall structure of the underlying dataset. Recent work by Chakraborty et al, introduced a fair variant under proportionate fairness and obtained a constant-factor approximation by naively selecting the best closest fair input clustering; however, their offline approach requires storing all input clusterings, which is prohibitively expensive for most large-scale applications. In this paper, we initiate the study of fair consensus clustering in the streaming model, where input clusterings arrive sequentially and memory is limited. We design the first constant-factor algorithm that processes the stream while storing only a logarithmic number of inputs. En route, we introduce a new generic algorithmic framework that integrates closest fair clustering with cluster fitting, yielding improved approximation guarantees not only in the streaming setting but also when revisited offline. Furthermore, the framework is fairness-agnostic: it applies to any fairness definition for which an approximately close fair clustering can be computed efficiently. Finally, we extend our methods to the more general k-median consensus clustering problem.

A Generic Framework for Fair Consensus Clustering in Streams

TL;DR

The paper addresses fair consensus clustering in data streams, introducing the first constant-factor streaming algorithms for both 1-median and k-median objectives under fairness constraints. It develops a generic two-phase framework that combines closest fair clustering with cluster fitting and uses logarithmic-sample strategies to achieve sublinear space, with a -approximation for 1-median and a -approximation for k-median, assuming access to efficient closest-fair subroutines. The contributions include a modular, fairness-agnostic approach, near-optimal space complexity, and extensions to the streaming setting, plus concrete corollaries for common two-color fairness regimes. The results advance scalable, fair ensemble clustering in dynamic environments and have potential impact on federated, streaming analyses and real-time decision support where fairness is a concern.

Abstract

Consensus clustering seeks to combine multiple clusterings of the same dataset, potentially derived by considering various non-sensitive attributes by different agents in a multi-agent environment, into a single partitioning that best reflects the overall structure of the underlying dataset. Recent work by Chakraborty et al, introduced a fair variant under proportionate fairness and obtained a constant-factor approximation by naively selecting the best closest fair input clustering; however, their offline approach requires storing all input clusterings, which is prohibitively expensive for most large-scale applications. In this paper, we initiate the study of fair consensus clustering in the streaming model, where input clusterings arrive sequentially and memory is limited. We design the first constant-factor algorithm that processes the stream while storing only a logarithmic number of inputs. En route, we introduce a new generic algorithmic framework that integrates closest fair clustering with cluster fitting, yielding improved approximation guarantees not only in the streaming setting but also when revisited offline. Furthermore, the framework is fairness-agnostic: it applies to any fairness definition for which an approximately close fair clustering can be computed efficiently. Finally, we extend our methods to the more general k-median consensus clustering problem.
Paper Structure (20 sections, 23 theorems, 37 equations, 1 table, 4 algorithms)

This paper contains 20 sections, 23 theorems, 37 equations, 1 table, 4 algorithms.

Key Result

Theorem 1

Suppose that there is an $\gamma$-approximation closest fair clustering with running time $t_{1}(n)$, then there is a $(\gamma+1.92)$-approximation algorithm for fair consensus clustering that runs in time $O(m^{4}n^{2} + m^{3}t_{1}(n))$.

Theorems & Definitions (28)

  • Definition 2.1: $\textit{Fair Clustering}$
  • Definition 2.2: Closest $\textit{Fair Clustering}$
  • Definition 2.3: $1$-median Consensus Clustering problem
  • Definition 2.4: $k$-median Consensus Clustering problem
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • ...and 18 more