Provable Imbalanced Point Clustering
David Denisov, Dan Feldman, Shlomi Dolev, Michael Segal
TL;DR
This work addresses imbalanced clustering for the $k$-center problem in $\mathbb{R}^d$ without requiring labels. It introduces a relaxed mean-cluster loss $\tilde{\ell}$ and develops a coreset-based compression framework that yields provable guarantees, including a $\alpha=2\log^2(1+n)$-approximation and an $\varepsilon$-coreset of size $|C|=O\left(\frac{kd^3\log(k)\log^4(n)}{\varepsilon^2}\left(\log(\log(k)\log(n)) + \log\left(\frac{1}{\delta}\right)\right)\right)$ with runtime $O(ndk\log(1/\delta))$. The paper further introduces Bi-Criteria approximations and a Coreset construction method, and proposes Choice clustering to select the best clustering among multiple candidates. Empirically, the approach is validated on image quantization tasks and synthetic/real datasets, showing competitive performance and meaningful compression advantages, while operating without labeled data. Limitations include the current restriction to Euclidean distance and weighted-coreset outputs, with future work aimed at generalizing to other metrics and robust estimators.
Abstract
We suggest efficient and provable methods to compute an approximation for imbalanced point clustering, that is, fitting $k$-centers to a set of points in $\mathbb{R}^d$, for any $d,k\geq 1$. To this end, we utilize \emph{coresets}, which, in the context of the paper, are essentially weighted sets of points in $\mathbb{R}^d$ that approximate the fitting loss for every model in a given set, up to a multiplicative factor of $1\pm\varepsilon$. We provide [Section 3 and Section E in the appendix] experiments that show the empirical contribution of our suggested methods for real images (novel and reference), synthetic data, and real-world data. We also propose choice clustering, which by combining clustering algorithms yields better performance than each one separately.
