Table of Contents
Fetching ...

A General Anchor-Based Framework for Scalable Fair Clustering

Shengfei Wei, Suyuan Liu, Jun Wang, Ke Liang, Miaomiao Li, Lei Luo

TL;DR

The paper tackles the scalability gap in fair clustering by introducing the Anchor-based Fair Clustering Framework (AFCF), which reduces computation from operating on all $n$ samples to a much smaller set of anchors $m$ ($m \ll n$) and propagates fairness to the full dataset. AFCF comprises four modules: Fair Anchor Generation (FDAS), Anchor Fair Clustering, Fair Anchor Graph Construction (with a group-label co-constraint), and Label Propagation, supported by an ADMM-based optimization that preserves demographic parity. The authors prove that the fairness of the final clustering matches that of the anchor clustering and demonstrate linear-time scalability with strong clustering performance and fairness on large-scale benchmarks. Empirical results show orders-of-magnitude speedups across multiple baselines and datasets, with preserved or improved accuracy and fairness metrics, making large-scale fair clustering practically feasible.

Abstract

Fair clustering is crucial for mitigating bias in unsupervised learning, yet existing algorithms often suffer from quadratic or super-quadratic computational complexity, rendering them impractical for large-scale datasets. To bridge this gap, we introduce the Anchor-based Fair Clustering Framework (AFCF), a novel, general, and plug-and-play framework that empowers arbitrary fair clustering algorithms with linear-time scalability. Our approach first selects a small but representative set of anchors using a novel fair sampling strategy. Then, any off-the-shelf fair clustering algorithm can be applied to this small anchor set. The core of our framework lies in a novel anchor graph construction module, where we formulate an optimization problem to propagate labels while preserving fairness. This is achieved through a carefully designed group-label joint constraint, which we prove theoretically ensures that the fairness of the final clustering on the entire dataset matches that of the anchor clustering. We solve this optimization efficiently using an ADMM-based algorithm. Extensive experiments on multiple large-scale benchmarks demonstrate that AFCF drastically accelerates state-of-the-art methods, which reduces computational time by orders of magnitude while maintaining strong clustering performance and fairness guarantees.

A General Anchor-Based Framework for Scalable Fair Clustering

TL;DR

The paper tackles the scalability gap in fair clustering by introducing the Anchor-based Fair Clustering Framework (AFCF), which reduces computation from operating on all samples to a much smaller set of anchors () and propagates fairness to the full dataset. AFCF comprises four modules: Fair Anchor Generation (FDAS), Anchor Fair Clustering, Fair Anchor Graph Construction (with a group-label co-constraint), and Label Propagation, supported by an ADMM-based optimization that preserves demographic parity. The authors prove that the fairness of the final clustering matches that of the anchor clustering and demonstrate linear-time scalability with strong clustering performance and fairness on large-scale benchmarks. Empirical results show orders-of-magnitude speedups across multiple baselines and datasets, with preserved or improved accuracy and fairness metrics, making large-scale fair clustering practically feasible.

Abstract

Fair clustering is crucial for mitigating bias in unsupervised learning, yet existing algorithms often suffer from quadratic or super-quadratic computational complexity, rendering them impractical for large-scale datasets. To bridge this gap, we introduce the Anchor-based Fair Clustering Framework (AFCF), a novel, general, and plug-and-play framework that empowers arbitrary fair clustering algorithms with linear-time scalability. Our approach first selects a small but representative set of anchors using a novel fair sampling strategy. Then, any off-the-shelf fair clustering algorithm can be applied to this small anchor set. The core of our framework lies in a novel anchor graph construction module, where we formulate an optimization problem to propagate labels while preserving fairness. This is achieved through a carefully designed group-label joint constraint, which we prove theoretically ensures that the fairness of the final clustering on the entire dataset matches that of the anchor clustering. We solve this optimization efficiently using an ADMM-based algorithm. Extensive experiments on multiple large-scale benchmarks demonstrate that AFCF drastically accelerates state-of-the-art methods, which reduces computational time by orders of magnitude while maintaining strong clustering performance and fairness guarantees.

Paper Structure

This paper contains 32 sections, 2 theorems, 16 equations, 7 figures, 5 tables, 3 algorithms.

Key Result

Proposition 1

Under the fairness constraints in (eq:optimization), the balance metric $\text{balance}(\mathcal{C})$ of the final clustering equals that of the anchor clustering $\text{balance}(\mathcal{C}_{\text{a}})$:

Figures (7)

  • Figure 1: Conceptual framework of the proposed fair anchor-based clustering. Anchors are selected proportionally to cluster demographics, with the left yielding 3 circle and 1 square anchors and the right yielding 2 circle and 1 square anchors. This enables efficient fair clustering on anchor points only, where m $\ll$ n. Group-label joint constraints in the anchor graph $\mathbf{Z}$ maintain demographic proportions with sums of 3/7 for blue circles, 1/7 for blue squares, 2/7 for orange circles, and 1/7 for orange squares. These fairness properties propagate to final clusters through $\mathbf{Y} = \mathbf{Z}^\top\mathbf{L}$, preserving the original demographic ratios. The approach integrates proportional representation, plug-and-play algorithmic flexibility, and constrained graph optimization for fairness preservation.
  • Figure 2: NMI and Balance on Law School and Bank data sets w.r.t. different values of $\alpha$.
  • Figure 3: NMI and Balance on Law School and Bank data sets w.r.t. different values of $m$.
  • Figure 4: The convergence of the proposed algorithm for minimizing the objective in \ref{['eq:augmented_lag']}. The plots are based on the Bank and Law School datasets.
  • Figure 5: The convergence of the proposed algorithm for minimizing the objective. The plots are based on the Credit, Zafar and Census II datasets.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • proof