Clustering What Matters in Constrained Settings
Ragesh Jaiswal, Amit Kumar
TL;DR
The paper addresses outlier-constrained clustering for $k$-median and $k$-means by introducing an approximation-preserving reduction to the corresponding outlier-free problems. The core method uses a D^z-sampling based reduction that creates $q=f(k,m,\varepsilon)=\left( \frac{k+m}{\varepsilon}\right)^{O(m)}$ outlier-free instances and combines solutions via a randomized selection with high probability guarantees. When an $\alpha$-approximation for the outlier-free problem is available, the approach yields an $\alpha(1+\varepsilon)$-approximation for the outlier version, with quantified running time that scales with $q$ and the underlying subroutines. The work extends to labelled constrained clustering and general metric spaces, providing the first FPT approximation guarantees for several constrained outlier problems and broadening the applicability of reduction-based techniques in constrained clustering. This advance enables near-optimal solutions in settings with hard constraints (e.g., capacitated, fair, coloured) and a limited number of outliers, with implications for robust clustering in diverse data regimes.
Abstract
Constrained clustering problems generalize classical clustering formulations, e.g., $k$-median, $k$-means, by imposing additional constraints on the feasibility of clustering. There has been significant recent progress in obtaining approximation algorithms for these problems, both in the metric and the Euclidean settings. However, the outlier version of these problems, where the solution is allowed to leave out $m$ points from the clustering, is not well understood. In this work, we give a general framework for reducing the outlier version of a constrained $k$-median or $k$-means problem to the corresponding outlier-free version with only $(1+\varepsilon)$-loss in the approximation ratio. The reduction is obtained by mapping the original instance of the problem to $f(k,m, \varepsilon)$ instances of the outlier-free version, where $f(k, m, \varepsilon) = \left( \frac{k+m}{\varepsilon}\right)^{O(m)}$. As specific applications, we get the following results: - First FPT (in the parameters $k$ and $m$) $(1+\varepsilon)$-approximation algorithm for the outlier version of capacitated $k$-median and $k$-means in Euclidean spaces with hard capacities. - First FPT (in the parameters $k$ and $m$) $(3+\varepsilon)$ and $(9+\varepsilon)$ approximation algorithms for the outlier version of capacitated $k$-median and $k$-means, respectively, in general metric spaces with hard capacities. - First FPT (in the parameters $k$ and $m$) $(2-δ)$-approximation algorithm for the outlier version of the $k$-median problem under the Ulam metric. Our work generalizes the known results to a larger class of constrained clustering problems. Further, our reduction works for arbitrary metric spaces and so can extend clustering algorithms for outlier-free versions in both Euclidean and arbitrary metric spaces.
