Deterministic $k$-Median Clustering in Near-Optimal Time
Martín Costa, Ermiya Farokhnejad
TL;DR
This work addresses deterministic algorithms for metric $k$-median in near-optimal time. It combines a hierarchical partitioning framework with a restricted $k$-median approach and a restricted reverse greedy algorithm to achieve a $O(\log(n/k))$-approximation in $\tilde{O}(nk)$ time, matching the randomized results up to polylog factors but revealing a deterministic lower bound that grows with $\log n$ and depends on $k$. A key contribution is the Restricted-$k$-median routine and its efficient deterministic implementation, enabling a bottom-up assembly of a global solution from local, restricted subproblems. The results establish a separation between deterministic and randomized settings for $k$-median and extend to the $k$-means case, with implications for near-optimal clustering under deterministic constraints.
Abstract
The metric $k$-median problem is a textbook clustering problem. As input, we are given a metric space $V$ of size $n$ and an integer $k$, and our task is to find a subset $S \subseteq V$ of at most $k$ `centers' that minimizes the total distance from each point in $V$ to its nearest center in $S$. Mettu and Plaxton [UAI'02] gave a randomized algorithm for $k$-median that computes a $O(1)$-approximation in $\tilde O(nk)$ time. They also showed that any algorithm for this problem with a bounded approximation ratio must have a running time of $Ω(nk)$. Thus, the running time of their algorithm is optimal up to polylogarithmic factors. For deterministic $k$-median, Guha et al.~[FOCS'00] gave an algorithm that computes a $\text{poly}(\log (n/k))$-approximation in $\tilde O(nk)$ time, where the degree of the polynomial in the approximation is unspecified. To the best of our knowledge, this remains the state-of-the-art approximation of any deterministic $k$-median algorithm with this running time. This leads us to the following natural question: What is the best approximation of a deterministic $k$-median algorithm with near-optimal running time? We make progress in answering this question by giving a deterministic algorithm that computes a $O(\log(n/k))$-approximation in $\tilde O(nk)$ time. We also provide a lower bound showing that any deterministic algorithm with this running time must have an approximation ratio of $Ω(\log n/(\log k + \log \log n))$, establishing a gap between the randomized and deterministic settings for $k$-median.
