Table of Contents
Fetching ...

Strong bounds for large-scale Minimum Sum-of-Squares Clustering

Anna Livia Croella, Veronica Piccialli, Antonio M. Sudoso

TL;DR

This work tackles the challenge of validating global optimality in Minimum Sum-of-Squares Clustering (MSSC) on large-scale data by introducing a divide-and-conquer framework guided by an anticlustering auxiliary problem. By reformulating MSSC with Huygens' theorem, it derives a strong lower bound LB^* = sum_t MSSC(S_t,K) and analyzes the gap into per-subset suboptimality and between-subset dispersion, providing insight into bound tightness. The AVOC algorithm operationalizes this approach: it (i) constructs anticlustering partitions, (ii) evaluates lower-bound candidates using SOS-SDP or k-means as an efficient proxy, and (iii) iteratively swaps elements to maximize the bound, yielding certified optimality gaps often below 3%. Extensive synthetic and real-world experiments demonstrate that AVOC delivers tight certificates within practical times (often under a few hours) for datasets up to ~18k points, representing a meaningful tool for large-scale MSSC validation. Overall, the method fills a critical gap by enabling quantitative solution quality guarantees for MSSC in settings where exact optimization is intractable, with potential extensions to constrained variants and further scalability improvements.

Abstract

Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering, achieving global optimality is crucial, yet computationally challenging. The complexity of finding the global solution increases exponentially with the number of data points, making exact methods impractical for large-scale datasets. Even obtaining strong lower bounds on the optimal MSSC objective value is computationally prohibitive, making it difficult to assess the quality of heuristic solutions. We address this challenge by introducing a novel method to validate heuristic MSSC solutions through optimality gaps. Our approach employs a divide-and-conquer strategy, decomposing the problem into smaller instances that can be handled by an exact solver. The decomposition is guided by an auxiliary optimization problem, the "anticlustering problem", for which we design an efficient heuristic. Computational experiments demonstrate the effectiveness of the method for large-scale instances, achieving optimality gaps below 3% in most cases while maintaining reasonable computational times. These results highlight the practicality of our approach in assessing feasible clustering solutions for large datasets, bridging a critical gap in MSSC evaluation.

Strong bounds for large-scale Minimum Sum-of-Squares Clustering

TL;DR

This work tackles the challenge of validating global optimality in Minimum Sum-of-Squares Clustering (MSSC) on large-scale data by introducing a divide-and-conquer framework guided by an anticlustering auxiliary problem. By reformulating MSSC with Huygens' theorem, it derives a strong lower bound LB^* = sum_t MSSC(S_t,K) and analyzes the gap into per-subset suboptimality and between-subset dispersion, providing insight into bound tightness. The AVOC algorithm operationalizes this approach: it (i) constructs anticlustering partitions, (ii) evaluates lower-bound candidates using SOS-SDP or k-means as an efficient proxy, and (iii) iteratively swaps elements to maximize the bound, yielding certified optimality gaps often below 3%. Extensive synthetic and real-world experiments demonstrate that AVOC delivers tight certificates within practical times (often under a few hours) for datasets up to ~18k points, representing a meaningful tool for large-scale MSSC validation. Overall, the method fills a critical gap by enabling quantitative solution quality guarantees for MSSC in settings where exact optimization is intractable, with potential extensions to constrained variants and further scalability improvements.

Abstract

Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering, achieving global optimality is crucial, yet computationally challenging. The complexity of finding the global solution increases exponentially with the number of data points, making exact methods impractical for large-scale datasets. Even obtaining strong lower bounds on the optimal MSSC objective value is computationally prohibitive, making it difficult to assess the quality of heuristic solutions. We address this challenge by introducing a novel method to validate heuristic MSSC solutions through optimality gaps. Our approach employs a divide-and-conquer strategy, decomposing the problem into smaller instances that can be handled by an exact solver. The decomposition is guided by an auxiliary optimization problem, the "anticlustering problem", for which we design an efficient heuristic. Computational experiments demonstrate the effectiveness of the method for large-scale instances, achieving optimality gaps below 3% in most cases while maintaining reasonable computational times. These results highlight the practicality of our approach in assessing feasible clustering solutions for large datasets, bridging a critical gap in MSSC evaluation.

Paper Structure

This paper contains 18 sections, 6 theorems, 30 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

koontz1975branchdiehr1985evaluation Assume the dataset $O=\{p_1,\ldots,p_N\}$ is divided into $T$ subsets $\mathcal{S} = \{S_1,\ldots,S_T\}$ such that $\cup_{t=1}^TS_t = O$, $S_t\cap S_{t^\prime}=\emptyset$ and $S_t \neq \emptyset$ for all $t,t^\prime\in [T]$, $t\not=t^\prime$, i.e., $\mathcal{S}$ i

Figures (6)

  • Figure 1: A synthetic dataset of $N=64$ points, $D=2$ features with $K=4$ natural well-separated clusters and its optimal clustering partition.
  • Figure 2: A Partition $\mathcal{S}$ of the dataset of Figure \ref{['fig:exdata']} in $T=4$ subsets and corresponding optimal clustering on each subset. Here, $\textrm{MSCC}(O, K) = 575.674=\sum_{t=1}^T$MSSC$(S_t, K)$ and hence the lower bound is tight.
  • Figure 3: A clustering partition for a dataset of $N = 64$ points, with $K = 4$ clusters and $T=4$ anticlusters. The optimal value satisfies $\mathrm{MSCC}(O,K) = 575.674$, $\sum_{t=1}^T$MSSC$(S_t, K) = 102.86$ and the gap is $82\%$.
  • Figure 4: Flowchart of the AVOC algorithm.
  • Figure 5: Visualization of the synthetic datasets generated with number of data points $N=10,000$, number of clusters $K = 3$, and noise level $\sigma \in \{0.50, 0.75, 1.00\}$.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Proposition 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Corollary 1