On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and Cost-Predictable k-means
Yushuai Ji, Zepeng Liu, Sheng Wang, Yuan Sun, Zhiyong Peng
TL;DR
Dask-means introduces a fast, memory-efficient k-means for large-scale spatial vectors by employing a memory-tunable centroid and spatial-vector Ball-tree index with batched $\mathtt{k}$NN-based pruning, eliminating the need for per-vector bounds. A lightweight cost estimator jointly predicts memory usage and runtime by modeling index size and iteration dynamics, using a linear predictor for the number of iterations and a non-linear regressor for per-iteration time, with Gaussian Process-based runtime adjustments to incorporate posterior information. Empirical results show substantial speedups (up to 168x) and sub-30 MB memory on large datasets, with high accuracy in memory and runtime predictions (memory error < 3%, runtime MSE significantly lower than SOTA). The approach enables efficient point-cloud and trajectory simplification on edge devices, offering practical benefits for real-time analytics and learning on resource-constrained hardware.
Abstract
The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been developed to achieve high performance with significant computational resources, such as memory and CPU usage time. These algorithms, though effective, are not well-suited for resource-constrained devices. In this paper, we propose a fast, memory-efficient, and cost-predictable k-means called Dask-means. We first accelerate k-means by designing a memory-efficient accelerator, which utilizes an optimized nearest neighbor search over a memory-tunable index to assign spatial vectors to clusters in batches. We then design a lightweight cost estimator to predict the memory cost and runtime of the k-means task, allowing it to request appropriate memory from devices or adjust the accelerator's required space to meet memory constraints, and ensure sufficient CPU time for running k-means. Experiments show that when simplifying datasets with scale such as $10^6$, Dask-means uses less than $30$MB of memory, achieves over $168$ times speedup compared to the widely-used Lloyd's algorithm. We also validate Dask-means on mobile devices, where it demonstrates significant speedup and low memory cost compared to other state-of-the-art (SOTA) k-means algorithms. Our cost estimator estimates the memory cost with a difference of less than $3\%$ from the actual ones and predicts runtime with an MSE up to $33.3\%$ lower than SOTA methods.
