Table of Contents
Fetching ...

A Global Optimization Algorithm for K-Center Clustering of One Billion Samples

Jiayang Ren, Ningning You, Kaixun Hua, Chaojie Ji, Yankai Cao

TL;DR

This work addresses the large-scale $K$-center clustering problem by introducing a tailored reduced-space branch-and-bound algorithm that guarantees finite-step convergence by branching only on the region of cluster centers. It features a two-stage decomposable lower bound with a closed-form solution, and accelerates pruning via bounds tightening, sample reduction, and parallelization, all implemented in Julia. Empirical results show the method solves datasets from $10^7$ to $10^9$ samples within 4 hours and achieves an average $25.8\%$ improvement in the objective over state-of-the-art heuristics. The approach enables globally optimal clustering at unprecedented scales, with open-source code and potential extensions to constrained variants.

Abstract

This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.

A Global Optimization Algorithm for K-Center Clustering of One Billion Samples

TL;DR

This work addresses the large-scale -center clustering problem by introducing a tailored reduced-space branch-and-bound algorithm that guarantees finite-step convergence by branching only on the region of cluster centers. It features a two-stage decomposable lower bound with a closed-form solution, and accelerates pruning via bounds tightening, sample reduction, and parallelization, all implemented in Julia. Empirical results show the method solves datasets from to samples within 4 hours and achieves an average improvement in the objective over state-of-the-art heuristics. The approach enables globally optimal clustering at unprecedented scales, with open-source code and potential extensions to constrained variants.

Abstract

This paper presents a practical global optimization algorithm for the K-center clustering problem, which aims to select K samples as the cluster centers to minimize the maximum within-cluster distance. This algorithm is based on a reduced-space branch and bound scheme and guarantees convergence to the global optimum in a finite number of steps by only branching on the regions of centers. To improve efficiency, we have designed a two-stage decomposable lower bound, the solution of which can be derived in a closed form. In addition, we also propose several acceleration techniques to narrow down the region of centers, including bounds tightening, sample reduction, and parallelization. Extensive studies on synthetic and real-world datasets have demonstrated that our algorithm can solve the K-center problems to global optimal within 4 hours for ten million samples in the serial mode and one billion samples in the parallel mode. Moreover, compared with the state-of-the-art heuristic methods, the global optimum obtained by our algorithm can averagely reduce the objective function by 25.8% on all the synthetic and real-world datasets.
Paper Structure (30 sections, 6 theorems, 13 equations, 6 figures, 5 tables, 5 algorithms)

This paper contains 30 sections, 6 theorems, 13 equations, 6 figures, 5 tables, 5 algorithms.

Key Result

Theorem 1

Algorithm alg: bb_sche is convergent to the global optimal solution after a finite step $L$, with $\beta_L=z=\alpha_L$, by only branching on the region of centers.

Figures (6)

  • Figure 1: Initial seeds with 3 clusters. In this example, $||x_1-x_2||^2_2>4\alpha$, $||x_2-x_3||^2_2>4\alpha$ and $||x_3-x_1||^2_2>4\alpha$. Therefore, we can arbitrarily assign $x_1, x_2, x_3$ to 3 distinct clusters.
  • Figure 2: Center-based assignment with 3 clusters. In this example, $\beta_{s}^2(M^2)>\alpha$ ($b_s^2=0$) and $\beta_{s}^3(M^3)>\alpha$ ($b_s^3=0$). Therefore, we assign $x_s$ to the first cluster ($b_s^1=1$).
  • Figure 3: Sample-based assignment with 3 clusters. Assume we already know that $x_1, x_2, x_3$ belong to cluster $1,2$ and $3$, respectively. $x_s$ is the sample to be determined. In this example, $||x_s-x_1||^2_2>4\alpha$ ($b_s^1=0$) and $||x_s-x_2||^2_2>4\alpha$ ($b_s^2=0$). Therefore, $x_s$ is assigned to cluster 3 ($b_s^3=1$).
  • Figure 4: Ball-based bounds tightening in two-dimensional space. In this example, suppose it is determined that two points $x_i$ and $x_j$ belong to the $K$th cluster. We first compute the index set of samples within all balls and original box, $\mathcal{S}^k_{+}(M):= \{s\in \mathcal{S} \ | x_{s}\in X\cap M^k \cap B_{\alpha}(x_i)\cap B_{\alpha}(x_j)\}$. We then generate the smallest box containing these samples in $\mathcal{S}^k_{+}(M)$. The red rectangle is the tightened bounds we obtain.
  • Figure 5: Box-based bounds tightening in two-dimensional space. In this example, we first generate two boxes with $R_{\alpha}(x_i):=\{x| \ x_i-\sqrt{\alpha}\leq x \leq x_i+\sqrt{\alpha}\}$ and $R_{\alpha}(x_j)=\{x|\ x_j-\sqrt{\alpha}\leq x \leq x_j+\sqrt{\alpha} \}$. We then create a tighten bounds with $\hat{M}^k {=} R_{\alpha}(x_i) \cap R_{\alpha}(x_j) \cap M^k$. The red rectangle is the tightened bounds we obtain.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Definition 1
  • Lemma 3
  • Theorem 2
  • Lemma 4