Table of Contents
Fetching ...

OneBatchPAM: A Fast and Frugal K-Medoids Algorithm

Antoine de Mathelin, Nicolas Enrique Cecchi, François Deheeger, Mathilde Mougeot, Nicolas Vayatis

TL;DR

Problem: scalable k-medoids clustering on large datasets with non-metric dissimilarities. Approach: OneBatchPAM performs a PAM-like local search using a batch of size $m = O(\log n)$ to estimate swap gains, achieving $O((p+T)n\log n)$ time and $O(n\log n)$ memory while preserving the same swap decisions as FasterPAM with high probability. Contributions: theoretical guarantee on batch size, substantial runtime reductions over FasterPAM and BanditPAM++, and competitive objective values on real datasets; exploration of sampling variants. Findings: OneBatchPAM delivers similar clustering quality to state-of-the-art methods at dramatically lower running times, especially for large-scale datasets, with manageable memory overhead compared to $O(n^2)$ baselines. Significance: enables practical k-medoids clustering for big data with generic dissimilarities and opens avenues for further efficiency through coresets or adaptive batching.

Abstract

This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m << n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.

OneBatchPAM: A Fast and Frugal K-Medoids Algorithm

TL;DR

Problem: scalable k-medoids clustering on large datasets with non-metric dissimilarities. Approach: OneBatchPAM performs a PAM-like local search using a batch of size to estimate swap gains, achieving time and memory while preserving the same swap decisions as FasterPAM with high probability. Contributions: theoretical guarantee on batch size, substantial runtime reductions over FasterPAM and BanditPAM++, and competitive objective values on real datasets; exploration of sampling variants. Findings: OneBatchPAM delivers similar clustering quality to state-of-the-art methods at dramatically lower running times, especially for large-scale datasets, with manageable memory overhead compared to baselines. Significance: enables practical k-medoids clustering for big data with generic dissimilarities and opens avenues for further efficiency through coresets or adaptive batching.

Abstract

This paper proposes a novel k-medoids approximation algorithm to handle large-scale datasets with reasonable computational time and memory complexity. We develop a local-search algorithm that iteratively improves the medoid selection based on the estimation of the k-medoids objective. A single batch of size m << n provides the estimation, which reduces the required memory size and the number of pairwise dissimilarities computations to O(mn), instead of O(n^2) compared to most k-medoids baselines. We obtain theoretical results highlighting that a batch of size m = O(log(n)) is sufficient to guarantee, with strong probability, the same performance as the original local-search algorithm. Multiple experiments conducted on real datasets of various sizes and dimensions show that our algorithm provides similar performances as state-of-the-art methods such as FasterPAM and BanditPAM++ with a drastically reduced running time.

Paper Structure

This paper contains 16 sections, 3 theorems, 22 equations, 31 figures, 8 tables, 2 algorithms.

Key Result

Theorem 1

Let $\mathcal{X}_m$ be a subsample uniformly drawn from $\mathcal{X}_n$. Let $D = \max_{(x, x') \in \mathcal{X}_n} d(x, x')$ and $\Delta$ be the smallest difference between two objectives computed by FasterPAM. Then, for any $\delta \in ]0, 1]$, the OneBatchPAM algorithm returns the same set of medo Where $\Delta = \underset{t \in [\![0, T]\!]}{\min} \underset{\substack{x \in \mathcal{M}_t, \\ x'

Figures (31)

  • Figure 1: Evolution of the running time and objective on the MNIST dataset. Left: evolution as a function of $n$ for $k = 10$. Right: evolution as a function of $k$ for $n = 10000$. The results for five competitors are reported: k-means++ (KM), FasterPAM (FP), FasterCLARA-5 (FC), BanditPAM++-2 (BP), OneBatchPAM (OBP)
  • Figure 2: RT and $\Delta$RO for Abalone
  • Figure 3: RT and $\Delta$RO for Bankruptcy
  • Figure 4: RT and $\Delta$RO for Mapping
  • Figure 5: RT and $\Delta$RO for Drybean
  • ...and 26 more figures

Theorems & Definitions (5)

  • Theorem 1
  • proof
  • Corollary 2
  • Theorem 1
  • proof