Table of Contents
Fetching ...

On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and Cost-Predictable k-means

Yushuai Ji, Zepeng Liu, Sheng Wang, Yuan Sun, Zhiyong Peng

TL;DR

Dask-means introduces a fast, memory-efficient k-means for large-scale spatial vectors by employing a memory-tunable centroid and spatial-vector Ball-tree index with batched $\mathtt{k}$NN-based pruning, eliminating the need for per-vector bounds. A lightweight cost estimator jointly predicts memory usage and runtime by modeling index size and iteration dynamics, using a linear predictor for the number of iterations and a non-linear regressor for per-iteration time, with Gaussian Process-based runtime adjustments to incorporate posterior information. Empirical results show substantial speedups (up to 168x) and sub-30 MB memory on large datasets, with high accuracy in memory and runtime predictions (memory error < 3%, runtime MSE significantly lower than SOTA). The approach enables efficient point-cloud and trajectory simplification on edge devices, offering practical benefits for real-time analytics and learning on resource-constrained hardware.

Abstract

The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been developed to achieve high performance with significant computational resources, such as memory and CPU usage time. These algorithms, though effective, are not well-suited for resource-constrained devices. In this paper, we propose a fast, memory-efficient, and cost-predictable k-means called Dask-means. We first accelerate k-means by designing a memory-efficient accelerator, which utilizes an optimized nearest neighbor search over a memory-tunable index to assign spatial vectors to clusters in batches. We then design a lightweight cost estimator to predict the memory cost and runtime of the k-means task, allowing it to request appropriate memory from devices or adjust the accelerator's required space to meet memory constraints, and ensure sufficient CPU time for running k-means. Experiments show that when simplifying datasets with scale such as $10^6$, Dask-means uses less than $30$MB of memory, achieves over $168$ times speedup compared to the widely-used Lloyd's algorithm. We also validate Dask-means on mobile devices, where it demonstrates significant speedup and low memory cost compared to other state-of-the-art (SOTA) k-means algorithms. Our cost estimator estimates the memory cost with a difference of less than $3\%$ from the actual ones and predicts runtime with an MSE up to $33.3\%$ lower than SOTA methods.

On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and Cost-Predictable k-means

TL;DR

Dask-means introduces a fast, memory-efficient k-means for large-scale spatial vectors by employing a memory-tunable centroid and spatial-vector Ball-tree index with batched NN-based pruning, eliminating the need for per-vector bounds. A lightweight cost estimator jointly predicts memory usage and runtime by modeling index size and iteration dynamics, using a linear predictor for the number of iterations and a non-linear regressor for per-iteration time, with Gaussian Process-based runtime adjustments to incorporate posterior information. Empirical results show substantial speedups (up to 168x) and sub-30 MB memory on large datasets, with high accuracy in memory and runtime predictions (memory error < 3%, runtime MSE significantly lower than SOTA). The approach enables efficient point-cloud and trajectory simplification on edge devices, offering practical benefits for real-time analytics and learning on resource-constrained hardware.

Abstract

The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been developed to achieve high performance with significant computational resources, such as memory and CPU usage time. These algorithms, though effective, are not well-suited for resource-constrained devices. In this paper, we propose a fast, memory-efficient, and cost-predictable k-means called Dask-means. We first accelerate k-means by designing a memory-efficient accelerator, which utilizes an optimized nearest neighbor search over a memory-tunable index to assign spatial vectors to clusters in batches. We then design a lightweight cost estimator to predict the memory cost and runtime of the k-means task, allowing it to request appropriate memory from devices or adjust the accelerator's required space to meet memory constraints, and ensure sufficient CPU time for running k-means. Experiments show that when simplifying datasets with scale such as , Dask-means uses less than MB of memory, achieves over times speedup compared to the widely-used Lloyd's algorithm. We also validate Dask-means on mobile devices, where it demonstrates significant speedup and low memory cost compared to other state-of-the-art (SOTA) k-means algorithms. Our cost estimator estimates the memory cost with a difference of less than from the actual ones and predicts runtime with an MSE up to lower than SOTA methods.

Paper Structure

This paper contains 25 sections, 21 equations, 15 figures, 9 tables, 1 algorithm.

Figures (15)

  • Figure 1: The simplified point clouds by random sampling (left) and our $k$-means clustering algorithm (right).
  • Figure 2: Pruning using ball node and inter bound.
  • Figure 3: Framework of Dask-means.
  • Figure 4: Pruning with a single indexing tree, where a spatial vector node $N$ contains two child nodes; pruning with centroid index node $N_C$, where $\mathbf{c}_{n_1}$ and $\mathbf{c}_{n_2}$ represent the two nearest centroids to $N$'s pivot ($\mathbf{p}^*$), with the corresponding distances $d_1$ and $d_2$ (where $d_2 > d_1$); $N.\mathbf{p}^*$ refers to the pivot of $N'$ and $N_C.\mathbf{p}$ refers to the pivot of $N_C$.
  • Figure 5: Overview of our lightweight cost estimator, where $y_i$ denotes the actual runtime for the $i$-th iteration ($i = 1, 2, \dots, q$), and $\hat{y}_j$ represents the predicted runtime for the $j$-th iteration ($j = 1, 2, \dots, q$).
  • ...and 10 more figures