On Approximability of $\ell_2^2$ Min-Sum Clustering
Karthik C. S., Euiwoong Lee, Yuval Rabani, Chris Schwiegelshohn, Samson Zhou
TL;DR
This work establishes the approximability landscape for the $ abla_2^2$ min-sum $k$-clustering objective. It proves unconditional NP-hardness to approximate within a factor better than $1.056$, and, under a dense version of the Johnson Coverage Hypothesis, hardness to a factor of $1.327$, highlighting a concrete barrier to polynomial-time optimization. Complementing these hardness results, it introduces a nearly linear-time parameterized PTAS based on $D^2$ sampling that runs in $O\left(n^{1+o(1)}d\cdot \exp\left((k\varepsilon^{-1})^{O(1)}\right)\right)$, making near-optimal clustering feasible for moderate $k$ and $ abla$. The paper also extends to a learning-augmented setting with a label oracle, yielding a polynomial-time $(1+\gamma\alpha)/(1-\alpha)^2$-approximation for $\alpha\in[0,1/2)$, thereby integrating data-driven guidance with robust performance guarantees. Overall, the results advance understanding of the Euclidean min-sum clustering problem, providing sharp hardness thresholds, scalable algorithms, and a principled framework for learning-enhanced clustering.
Abstract
The $\ell_2^2$ min-sum $k$-clustering problem is to partition an input set into clusters $C_1,\ldots,C_k$ to minimize $\sum_{i=1}^k\sum_{p,q\in C_i}\|p-q\|_2^2$. Although $\ell_2^2$ min-sum $k$-clustering is NP-hard, it is not known whether it is NP-hard to approximate $\ell_2^2$ min-sum $k$-clustering beyond a certain factor. In this paper, we give the first hardness-of-approximation result for the $\ell_2^2$ min-sum $k$-clustering problem. We show that it is NP-hard to approximate the objective to a factor better than $1.056$ and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than 1.327. We then complement our hardness result by giving a nearly linear time parameterized PTAS for $\ell_2^2$ min-sum $k$-clustering running in time $O\left(n^{1+o(1)}d\cdot \exp((k\cdot\varepsilon^{-1})^{O(1)})\right)$, where $d$ is the underlying dimension of the input dataset. Finally, we consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label $i\in[k]$ for input point, thereby implicitly partitioning the input dataset into $k$ clusters that induce an approximately optimal solution, up to some amount of adversarial error $α\in\left[0,\frac{1}{2}\right)$. We give a polynomial-time algorithm that outputs a $\frac{1+γα}{(1-α)^2}$-approximation to $\ell_2^2$ min-sum $k$-clustering, for a fixed constant $γ>0$.
