Table of Contents
Fetching ...

On Approximability of $\ell_2^2$ Min-Sum Clustering

Karthik C. S., Euiwoong Lee, Yuval Rabani, Chris Schwiegelshohn, Samson Zhou

TL;DR

This work establishes the approximability landscape for the $ abla_2^2$ min-sum $k$-clustering objective. It proves unconditional NP-hardness to approximate within a factor better than $1.056$, and, under a dense version of the Johnson Coverage Hypothesis, hardness to a factor of $1.327$, highlighting a concrete barrier to polynomial-time optimization. Complementing these hardness results, it introduces a nearly linear-time parameterized PTAS based on $D^2$ sampling that runs in $O\left(n^{1+o(1)}d\cdot \exp\left((k\varepsilon^{-1})^{O(1)}\right)\right)$, making near-optimal clustering feasible for moderate $k$ and $ abla$. The paper also extends to a learning-augmented setting with a label oracle, yielding a polynomial-time $(1+\gamma\alpha)/(1-\alpha)^2$-approximation for $\alpha\in[0,1/2)$, thereby integrating data-driven guidance with robust performance guarantees. Overall, the results advance understanding of the Euclidean min-sum clustering problem, providing sharp hardness thresholds, scalable algorithms, and a principled framework for learning-enhanced clustering.

Abstract

The $\ell_2^2$ min-sum $k$-clustering problem is to partition an input set into clusters $C_1,\ldots,C_k$ to minimize $\sum_{i=1}^k\sum_{p,q\in C_i}\|p-q\|_2^2$. Although $\ell_2^2$ min-sum $k$-clustering is NP-hard, it is not known whether it is NP-hard to approximate $\ell_2^2$ min-sum $k$-clustering beyond a certain factor. In this paper, we give the first hardness-of-approximation result for the $\ell_2^2$ min-sum $k$-clustering problem. We show that it is NP-hard to approximate the objective to a factor better than $1.056$ and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than 1.327. We then complement our hardness result by giving a nearly linear time parameterized PTAS for $\ell_2^2$ min-sum $k$-clustering running in time $O\left(n^{1+o(1)}d\cdot \exp((k\cdot\varepsilon^{-1})^{O(1)})\right)$, where $d$ is the underlying dimension of the input dataset. Finally, we consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label $i\in[k]$ for input point, thereby implicitly partitioning the input dataset into $k$ clusters that induce an approximately optimal solution, up to some amount of adversarial error $α\in\left[0,\frac{1}{2}\right)$. We give a polynomial-time algorithm that outputs a $\frac{1+γα}{(1-α)^2}$-approximation to $\ell_2^2$ min-sum $k$-clustering, for a fixed constant $γ>0$.

On Approximability of $\ell_2^2$ Min-Sum Clustering

TL;DR

This work establishes the approximability landscape for the min-sum -clustering objective. It proves unconditional NP-hardness to approximate within a factor better than , and, under a dense version of the Johnson Coverage Hypothesis, hardness to a factor of , highlighting a concrete barrier to polynomial-time optimization. Complementing these hardness results, it introduces a nearly linear-time parameterized PTAS based on sampling that runs in , making near-optimal clustering feasible for moderate and . The paper also extends to a learning-augmented setting with a label oracle, yielding a polynomial-time -approximation for , thereby integrating data-driven guidance with robust performance guarantees. Overall, the results advance understanding of the Euclidean min-sum clustering problem, providing sharp hardness thresholds, scalable algorithms, and a principled framework for learning-enhanced clustering.

Abstract

The min-sum -clustering problem is to partition an input set into clusters to minimize . Although min-sum -clustering is NP-hard, it is not known whether it is NP-hard to approximate min-sum -clustering beyond a certain factor. In this paper, we give the first hardness-of-approximation result for the min-sum -clustering problem. We show that it is NP-hard to approximate the objective to a factor better than and moreover, assuming a balanced variant of the Johnson Coverage Hypothesis, it is NP-hard to approximate the objective to a factor better than 1.327. We then complement our hardness result by giving a nearly linear time parameterized PTAS for min-sum -clustering running in time , where is the underlying dimension of the input dataset. Finally, we consider a learning-augmented setting, where the algorithm has access to an oracle that outputs a label for input point, thereby implicitly partitioning the input dataset into clusters that induce an approximately optimal solution, up to some amount of adversarial error . We give a polynomial-time algorithm that outputs a -approximation to min-sum -clustering, for a fixed constant .

Paper Structure

This paper contains 31 sections, 33 theorems, 95 equations, 5 figures, 2 algorithms.

Key Result

Theorem 1.3

It is NP-hard to approximate $\ell_2^2$ min-sum $k$-clustering to a factor better than $1.056$. Moreover, assuming the Dense and Balanced Johnson Coverage Hypothesis ($\mathsf{Balanced-JCH}^*$), we have that the $\ell_2^2$ min-sum $k$-clustering is NP-hard to approximate to a factor better than $1.3

Figures (5)

  • Figure 1: Clustering of input dataset in \ref{['fig:fig:cluster:a']} with $k=2$. \ref{['fig:fig:cluster:b']} is an optimal centroid-based clustering, e.g., $k$-median or $k$-means, while the more natural clustering in \ref{['fig:fig:cluster:c']} is an optimal density-based clustering, e.g., $\ell_2$ min-sum $k$-clustering.
  • Figure 2: Note that with arbitrarily small error rate, i.e., $\frac{1}{n}$, a single mislabeled point among the $n$ input points causes the resulting clustering to be arbitrarily bad for $\Delta\gg n^2\cdot R$.
  • Figure 3: Examples of input instances of the Johnson Coverage Hypothesis for $k=2$. \ref{['fig:fig:jch:one:two']} shows an example of a completeness instance of $\left(0.7,2,1\right)$, since all subsets of size $2$, i.e., all edges, can be covered by $k=2$ choices of subset of size $1$, i.e., two vertices. \ref{['fig:fig:jch:one:three']} shows an example of a completeness instance of $\left(0.7,3,1\right)$, since all subsets of size $3$ can be covered by $k=2$ vertices. \ref{['fig:fig:jch:two:three']} shows an example of a soundness instance of $\left(0.7,3,2\right)$, since at most $2\le 0.7\cdot 4$ subsets of size $3$ can be covered by any choice of $k=2$ edges.
  • Figure 4: Constrained min-cost flow problem
  • Figure 5: Example of transformation of capacitated min-cost flow problem into uncapacitated min-cost flow problem.

Theorems & Definitions (57)

  • Theorem 1.3: Hardness of approximation of $\ell_2^2$ min-sum $k$-clustering
  • Theorem 1.4
  • Theorem 1.5
  • Definition 2.1: Johnson Coverage Problem
  • Theorem 2.5: Cohen-AddadSL22
  • Theorem 2.6
  • Theorem 2.7
  • Claim 2.8: Claim 3.18 in Cohen-AddadSL22
  • Definition 2.9
  • Theorem 2.10
  • ...and 47 more