Table of Contents
Fetching ...

A Tight VC-Dimension Analysis of Clustering Coresets with Applications

Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, Chris Schwiegelshohn

TL;DR

This work gives a sharp VC-dimension based analysis for constructing coresets in k-clustering, yielding near-optimal coreset sizes across several metrics by coupling layered group sampling with clustering nets. The core technique introduces layered group sampling to reduce the number of groups and clustering nets to discretize the cost space of candidate solutions, enabling uniform concentration bounds via Gaussian-process reductions. The results include improved coreset bounds for shortest-path metrics in planar graphs and Frechet/Hausdorff-type metrics for polygonal curves, among others, and they clarify the limits of VC-dimension based bounds inEuclidean spaces. Together, these advances provide a principled, broadly applicable framework for small, provably accurate coresets in diverse metric settings, with direct implications for efficient clustering under practical distance measures.

Abstract

We consider coresets for $k$-clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the $k$-median objective $\sum_{p}\min_{c\in C}dist(p,C)$. Given a point set $P$, a coreset $Ω$ is a small weighted subset that approximates the cost of $P$ for all candidate solutions $C$ up to a $(1\pm\varepsilon )$ multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved $k$-median coreset bounds for the following metrics: Coresets of size $\tilde{O}\left(k\varepsilon^{-2}\right)$ for shortest path metrics in planar graphs, improving over the bounds $\tilde{O}\left(k\varepsilon^{-6}\right)$ by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and $\tilde{O}\left(k^2\varepsilon^{-4}\right)$ by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size $\tilde{O}\left(kd\ell\varepsilon^{-2}\log m\right)$ for clustering $d$-dimensional polygonal curves of length at most $m$ with curves of length at most $\ell$ with respect to Frechet metrics, improving over the bounds $\tilde{O}\left(k^3d\ell\varepsilon^{-3}\log m\right)$ by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and $\tilde{O}\left(k^2d\ell\varepsilon^{-2}\log m \log |P|\right)$ by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].

A Tight VC-Dimension Analysis of Clustering Coresets with Applications

TL;DR

This work gives a sharp VC-dimension based analysis for constructing coresets in k-clustering, yielding near-optimal coreset sizes across several metrics by coupling layered group sampling with clustering nets. The core technique introduces layered group sampling to reduce the number of groups and clustering nets to discretize the cost space of candidate solutions, enabling uniform concentration bounds via Gaussian-process reductions. The results include improved coreset bounds for shortest-path metrics in planar graphs and Frechet/Hausdorff-type metrics for polygonal curves, among others, and they clarify the limits of VC-dimension based bounds inEuclidean spaces. Together, these advances provide a principled, broadly applicable framework for small, provably accurate coresets in diverse metric settings, with direct implications for efficient clustering under practical distance measures.

Abstract

We consider coresets for -clustering problems, where the goal is to assign points to centers minimizing powers of distances. A popular example is the -median objective . Given a point set , a coreset is a small weighted subset that approximates the cost of for all candidate solutions up to a multiplicative factor. In this paper, we give a sharp VC-dimension based analysis for coreset construction. As a consequence, we obtain improved -median coreset bounds for the following metrics: Coresets of size for shortest path metrics in planar graphs, improving over the bounds by [Cohen-Addad, Saulpic, Schwiegelshohn, STOC'21] and by [Braverman, Jiang, Krauthgamer, Wu, SODA'21]. Coresets of size for clustering -dimensional polygonal curves of length at most with curves of length at most with respect to Frechet metrics, improving over the bounds by [Braverman, Cohen-Addad, Jiang, Krauthgamer, Schwiegelshohn, Toftrup, and Wu, FOCS'22] and by [Conradi, Kolbe, Psarros, Rohde, SoCG'24].
Paper Structure (33 sections, 24 theorems, 97 equations, 2 figures, 1 table, 1 algorithm)

This paper contains 33 sections, 24 theorems, 97 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Theorem 1.1

Let $P$ be a point set in some metric space $(X,\textup{dist})$ and let $d_{\textmd{\textup{VC}}}$ be the VC dimension of metric balls. Then there exists an $\varepsilon$-coreset for $k$-median of size

Figures (2)

  • Figure 1: We draw one layered group $G$, composed of points in various $A_g$'s, i.e., for each $i$ points may belong to different clusters all with roughly the same average cost. Points in $A_g$ is only allowed to be served by centroids in the approximate centroid sets of $A_j$'s for $j \leq g$ (color-coded squares in the figure).
  • Figure 2: Black dashed circles depict the cluster core along with its average cost. Black solid circles instead represent the cluster's innermost of the outer rings falling the $\ell$-th layer, to which group $G$ belongs. As we may observe, even though the average cost may be different (see $\Delta_{C_2}$ vs. $\Delta_{C_3}$), they still belong to the same cluster union $O_2$ (blue dotted oval) because the $\ell$-th layering induces an innermost of the outer rings that, in the case of $C_2 \cap G$, is much further than the innermost of the outer rings for $C_3 \cap G$. Finally, points in $O_2$ are only allowed to be served by centroids in its own approximate centroid set or in $O_1$'s (red dotted oval) approximate centroid set.

Theorems & Definitions (42)

  • Theorem 1.1
  • Theorem 1.2
  • Proposition 2.1
  • proof
  • Definition 2.2: Clusters of Type $i$
  • Lemma 2.2
  • Lemma 2.2
  • Lemma 2.2
  • Lemma 2.2
  • Lemma 2.2
  • ...and 32 more