Table of Contents
Fetching ...

Linear time small coresets for k-mean clustering of segments with applications

David Denisov, Shlomi Dolev, Dan Felmdan, Michael Segal

TL;DR

This work addresses the problem of k-means clustering for a set of segments in $\mathbb{R}^d$, where the loss is the integral of distances along each segment. The authors introduce a per-segment coreset construction (Seg-Coreset) that represents each segment by a small finite point set, and then combine these with a weighted-point coreset framework (outliers-resistance) to obtain a global coreset for the entire segment set. Theoretical guarantees show a coreset size of $O(\log^2 n)$ for fixed $k$ and $\varepsilon$, with an overall construction time of $O(nd)$, enabling efficient streaming and distributed computation. Empirical validation on synthetic data, motion vectors for video tracking, and OpenStreetMap road segments demonstrates substantial speedups with minimal loss in clustering accuracy, including a real-time video tracking example achieving over 1,400 frames per second on standard hardware. Overall, the work extends point-based coreset techniques to segments, enabling scalable, provable segment clustering and impactful applications in video analytics and geospatial data analysis.

Abstract

We study the $k$-means problem for a set $\mathcal{S} \subseteq \mathbb{R}^d$ of $n$ segments, aiming to find $k$ centers $X \subseteq \mathbb{R}^d$ that minimize $D(\mathcal{S},X) := \sum_{S \in \mathcal{S}} \min_{x \in X} D(S,x)$, where $D(S,x) := \int_{p \in S} |p - x| dp$ measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any $\varepsilon > 0$, an $\varepsilon$-coreset is a weighted subset $C \subseteq \mathbb{R}^d$ that approximates $D(\mathcal{S},X)$ within a factor of $1 \pm \varepsilon$ for any set of $k$ centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant $k$ and $\varepsilon$, it produces a coreset of size $O(\log^2 n)$ computable in $O(nd)$ time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.

Linear time small coresets for k-mean clustering of segments with applications

TL;DR

This work addresses the problem of k-means clustering for a set of segments in , where the loss is the integral of distances along each segment. The authors introduce a per-segment coreset construction (Seg-Coreset) that represents each segment by a small finite point set, and then combine these with a weighted-point coreset framework (outliers-resistance) to obtain a global coreset for the entire segment set. Theoretical guarantees show a coreset size of for fixed and , with an overall construction time of , enabling efficient streaming and distributed computation. Empirical validation on synthetic data, motion vectors for video tracking, and OpenStreetMap road segments demonstrates substantial speedups with minimal loss in clustering accuracy, including a real-time video tracking example achieving over 1,400 frames per second on standard hardware. Overall, the work extends point-based coreset techniques to segments, enabling scalable, provable segment clustering and impactful applications in video analytics and geospatial data analysis.

Abstract

We study the -means problem for a set of segments, aiming to find centers that minimize , where measures the total distance from each point along a segment to a center. Variants of this problem include handling outliers, employing alternative distance functions such as M-estimators, weighting distances to achieve balanced clustering, or enforcing unique cluster assignments. For any , an -coreset is a weighted subset that approximates within a factor of for any set of centers, enabling efficient streaming, distributed, or parallel computation. We propose the first coreset construction that provably handles arbitrary input segments. For constant and , it produces a coreset of size computable in time. Experiments, including a real-time video tracking application, demonstrate substantial speedups with minimal loss in clustering accuracy, confirming both the practical efficiency and theoretical guarantees of our method.

Paper Structure

This paper contains 2 sections.

Theorems & Definitions (1)

  • definition thmcounterdefinition