Table of Contents
Fetching ...

Local Search-based Individually Fair Clustering with Outliers

Binita Maity, Shrutimoy Das, Anirban Dasgupta

TL;DR

This work tackles individually fair clustering in the presence of outliers by first discarding a subset of fairness-based outliers and then applying a constrained local-search refinement. It introduces BaseCent, a novel seeding that identifies anchor zones and fairness-based outliers, followed by LSFO, a constrained local-search procedure that maintains anchor-zone feasibility while iteratively improving the $k$-means objective. The authors provide a bound on the number of outliers discarded and prove an $O(1)$-approximation for fixed $$, with empirical validation on real-world datasets showing improved clustering cost and competitive fairness violations. The approach demonstrates scalability to large datasets and robustness to outliers, offering practical applicability for fair clustering in noisy data environments.

Abstract

In this paper, we present a local search-based algorithm for individually fair clustering in the presence of outliers. We consider the individual fairness definition proposed in Jung et al., which requires that each of the $n$ points in the dataset must have one of the $k$ centers within its $n/k$ nearest neighbors. However, if the dataset is known to contain outliers, the set of fair centers obtained under this definition might be suboptimal for non-outlier points. In order to address this issue, we propose a method that discards a set of points marked as outliers and computes the set of fair centers for the remaining non-outlier points. Our method utilizes a randomized variant of local search, which makes it scalable to large datasets. We also provide an approximation guarantee of our method as well as a bound on the number of outliers discarded. Additionally, we demonstrate our claims experimentally on a set of real-world datasets.

Local Search-based Individually Fair Clustering with Outliers

TL;DR

This work tackles individually fair clustering in the presence of outliers by first discarding a subset of fairness-based outliers and then applying a constrained local-search refinement. It introduces BaseCent, a novel seeding that identifies anchor zones and fairness-based outliers, followed by LSFO, a constrained local-search procedure that maintains anchor-zone feasibility while iteratively improving the -means objective. The authors provide a bound on the number of outliers discarded and prove an -approximation for fixed , with empirical validation on real-world datasets showing improved clustering cost and competitive fairness violations. The approach demonstrates scalability to large datasets and robustness to outliers, offering practical applicability for fair clustering in noisy data environments.

Abstract

In this paper, we present a local search-based algorithm for individually fair clustering in the presence of outliers. We consider the individual fairness definition proposed in Jung et al., which requires that each of the points in the dataset must have one of the centers within its nearest neighbors. However, if the dataset is known to contain outliers, the set of fair centers obtained under this definition might be suboptimal for non-outlier points. In order to address this issue, we propose a method that discards a set of points marked as outliers and computes the set of fair centers for the remaining non-outlier points. Our method utilizes a randomized variant of local search, which makes it scalable to large datasets. We also provide an approximation guarantee of our method as well as a bound on the number of outliers discarded. Additionally, we demonstrate our claims experimentally on a set of real-world datasets.

Paper Structure

This paper contains 17 sections, 8 theorems, 17 equations, 5 tables, 3 algorithms.

Key Result

Lemma 1

Suppose $n-m$ points are covered by $k+r$$\gamma$-anchor zones $(0 \leq r \le m)$, with the $m$ points having the largest fair radii being discarded as fairness-based outliers. Then, there exists a set of $k$$\gamma'$-anchor zones that covers $n-2m$ points, with $\gamma' = \gamma+2.$

Theorems & Definitions (18)

  • Definition 1: Fair radius $\delta(\cdot)$
  • Definition 2: Fair $k$-means clustering
  • Definition 3: $(\gamma, k, m)$- fair means clustering excluding outliers
  • Lemma 1
  • proof
  • Theorem 2
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 8 more