Strong bounds for large-scale Minimum Sum-of-Squares Clustering
Anna Livia Croella, Veronica Piccialli, Antonio M. Sudoso
TL;DR
This work tackles the challenge of validating global optimality in Minimum Sum-of-Squares Clustering (MSSC) on large-scale data by introducing a divide-and-conquer framework guided by an anticlustering auxiliary problem. By reformulating MSSC with Huygens' theorem, it derives a strong lower bound LB^* = sum_t MSSC(S_t,K) and analyzes the gap into per-subset suboptimality and between-subset dispersion, providing insight into bound tightness. The AVOC algorithm operationalizes this approach: it (i) constructs anticlustering partitions, (ii) evaluates lower-bound candidates using SOS-SDP or k-means as an efficient proxy, and (iii) iteratively swaps elements to maximize the bound, yielding certified optimality gaps often below 3%. Extensive synthetic and real-world experiments demonstrate that AVOC delivers tight certificates within practical times (often under a few hours) for datasets up to ~18k points, representing a meaningful tool for large-scale MSSC validation. Overall, the method fills a critical gap by enabling quantitative solution quality guarantees for MSSC in settings where exact optimization is intractable, with potential extensions to constrained variants and further scalability improvements.
Abstract
Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering, achieving global optimality is crucial, yet computationally challenging. The complexity of finding the global solution increases exponentially with the number of data points, making exact methods impractical for large-scale datasets. Even obtaining strong lower bounds on the optimal MSSC objective value is computationally prohibitive, making it difficult to assess the quality of heuristic solutions. We address this challenge by introducing a novel method to validate heuristic MSSC solutions through optimality gaps. Our approach employs a divide-and-conquer strategy, decomposing the problem into smaller instances that can be handled by an exact solver. The decomposition is guided by an auxiliary optimization problem, the "anticlustering problem", for which we design an efficient heuristic. Computational experiments demonstrate the effectiveness of the method for large-scale instances, achieving optimality gaps below 3% in most cases while maintaining reasonable computational times. These results highlight the practicality of our approach in assessing feasible clustering solutions for large datasets, bridging a critical gap in MSSC evaluation.
