Accelerating data-driven algorithm selection for combinatorial partitioning problems
Vaggos Chatziafratis, Ishani Karmarkar, Yingxi Li, Ellen Vitercik
TL;DR
The paper tackles the scalability bottleneck in data-driven algorithm selection by introducing size generalization, the problem of predicting an algorithm’s performance on large instances from small, representative subsamples. It develops rigorous guarantees for clustering and max-cut, covering center-based methods (k-means++, k-centers with a softened Gonzalez variant) and single-linkage, as well as GW and Greedy for max-cut, with subsample sizes that can be independent of entire instance size under natural conditions. The authors introduce Seeding and ApxSeeding for clustering, SoftmaxCenters to balance exploration and approximation, and provide a general SDP- and martingale-based framework to relate subgraph objectives to full-graph performance. Complemented by experiments on synthetic and real data, the work demonstrates substantial runtime speedups for algorithm selection while preserving predictive accuracy, and outlines a path to extend size generalization to broader optimization problems.
Abstract
Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting the one with the best empirical performance. However, running each algorithm on every training instance is computationally expensive, making scalability a central challenge. In practice, a common workaround is to evaluate algorithms on smaller proxy instances derived from the original inputs. However, this practice has remained largely ad hoc and lacked theoretical grounding. We provide the first theoretical foundations for this practice by formalizing the notion of size generalization: predicting an algorithm's performance on a large instance by evaluating it on a smaller, representative instance, subsampled from the original instance. We provide size generalization guarantees for three widely used clustering algorithms (single-linkage, $k$-means++, and Gonzalez's $k$-centers heuristic) and two canonical max-cut algorithms (Goemans-Williamson and Greedy). We characterize the subsample size sufficient to ensure that performance on the subsample reflects performance on the full instance, and our experiments support these findings.
