OneBatchPAM: A Fast and Frugal K-Medoids Algorithm
Antoine de Mathelin, Nicolas Enrique Cecchi, François Deheeger, Mathilde Mougeot, Nicolas Vayatis
TL;DR
Problem: scalable k-medoids clustering on large datasets with non-metric dissimilarities. Approach: OneBatchPAM performs a PAM-like local search using a batch of size $m = O(\log n)$ to estimate swap gains, achieving $O((p+T)n\log n)$ time and $O(n\log n)$ memory while preserving the same swap decisions as FasterPAM with high probability. Contributions: theoretical guarantee on batch size, substantial runtime reductions over FasterPAM and BanditPAM++, and competitive objective values on real datasets; exploration of sampling variants. Findings: OneBatchPAM delivers similar clustering quality to state-of-the-art methods at dramatically lower running times, especially for large-scale datasets, with manageable memory overhead compared to $O(n^2)$ baselines. Significance: enables practical k-medoids clustering for big data with generic dissimilarities and opens avenues for further efficiency through coresets or adaptive batching.
Abstract
This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m << n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.
