Table of Contents
Fetching ...

Improved seeding strategies for k-means and k-GMM

Guillaume Carrière, Frédéric Cazals

TL;DR

The paper advances seed initialization for both k-means and k-GMM by formalizing seeding as a three-component design problem: the sampling metric, the pool size, and the ranking metric. It introduces lookahead and multipass strategies that condition seed selection on eventual alignment with the final objective, and develops iterative reseeding variants that outperform prior methods such as greedy multi-swap. Across extensive experiments, the proposed zig-zag and center-of-m masses-based methods achieve constant-factor improvements in SSE for k-means and in log-likelihood for k-GMM, often with modest overhead. The results reveal nuanced behavior of seeding, including mild correlations between initial SSE and final outcomes, variance reduction through multipass seeding, and sensitivity to pool size, pointing to practical pathways toward standardization and theoretical study of seeding effectiveness.

Abstract

We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle--conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization. Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (SSE for k-means, log-likelihood for k-GMM), at a modest overhead. In particular, for k-means, our methods improve on the recently designed multi-swap strategy, which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods. Practically, our most effective seeding methods are strong candidates to become one of the--if not the--standard techniques. From a theoretical perspective, our formalization of seeding opens the door to a new line of analytical approaches.

Improved seeding strategies for k-means and k-GMM

TL;DR

The paper advances seed initialization for both k-means and k-GMM by formalizing seeding as a three-component design problem: the sampling metric, the pool size, and the ranking metric. It introduces lookahead and multipass strategies that condition seed selection on eventual alignment with the final objective, and develops iterative reseeding variants that outperform prior methods such as greedy multi-swap. Across extensive experiments, the proposed zig-zag and center-of-m masses-based methods achieve constant-factor improvements in SSE for k-means and in log-likelihood for k-GMM, often with modest overhead. The results reveal nuanced behavior of seeding, including mild correlations between initial SSE and final outcomes, variance reduction through multipass seeding, and sensitivity to pool size, pointing to practical pathways toward standardization and theoretical study of seeding effectiveness.

Abstract

We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used for seed selection. This analysis yields novel families of initialization methods exploiting a lookahead principle--conditioning the seed selection to an enhanced coherence with the final metric used to assess the algorithm, and a multipass strategy to tame down the effect of randomization. Experiments show a consistent constant factor improvement over classical contenders in terms of the final metric (SSE for k-means, log-likelihood for k-GMM), at a modest overhead. In particular, for k-means, our methods improve on the recently designed multi-swap strategy, which was the first one to outperform the greedy k-means++ seeding. Our experimental analysis also shed light on subtle properties of k-means often overlooked, including the (lack of) correlations between the SSE upon seeding and the final SSE, the variance reduction phenomena observed in iterative seeding methods, and the sensitivity of the final SSE to the pool size for greedy methods. Practically, our most effective seeding methods are strong candidates to become one of the--if not the--standard techniques. From a theoretical perspective, our formalization of seeding opens the door to a new line of analytical approaches.

Paper Structure

This paper contains 46 sections, 14 equations, 22 figures, 2 tables, 2 algorithms.

Figures (22)

  • Figure 1: Gains yielded by our seeding methods.(A, k-means) Mean and median (over 18 datasets) of min-maxed SSE $\IfNoValueTF{-NoValue-}{\Phi_{K}}{\Phi_{K,-NoValue-}}$ ($m_3(\overline{\Phi}_{K})$, see Sec. \ref{['sec:stats']}) and CPU time ($m_3(\overline{t})$, see Sec. \ref{['sec:stats']}), the smaller the better--see also Sec. \ref{['sec:kmeans-results']}. (B, k-GMM) Mean and median of min-maxed Log likelihood (over 1800 datasets), the larger the better--see also Sec. \ref{['sec:EM-results']}. Seeding methods have negligible impact on CPU time for k-GMM (data not shown)
  • Figure 2: k-means: boxplot of the $\Phi_{K}^\text{D}$ and $\Phi_{K}^\text{S-COM}$ values along the seeding selection process for each $k\in 1,\dots,K$. Statistics over 150 repeats on spam dataset.
  • Figure 3: k-means: min-max normalized value $m_3(\overline{\Phi}_{K})$ -- Eq. \ref{['eq:min-max-mean']}, as a function of the seeding method. For the reference: the seeding used in k-means++-G is SeedingEGD-EGD, and SeedingEGDx2-EGDx2 is the same with twice as many seeds to match the zig-zag strategy.
  • Figure 4: k-means: min-max normalized CPU total time $m_3(\bar{t})$ for each seeding method.
  • Figure 5: k-GMM: mean of the min-max normalized log-likelihood over datasets of each scenario. The larger the log-likelihood, the better. See text for details.
  • ...and 17 more figures

Theorems & Definitions (1)

  • Remark 1