Popcorn: Accelerating Kernel K-means on GPUs through Sparse Linear Algebra
Julian Bellavita, Thomas Pasquali, Laura Del Rio Martin, Flavio Vella, Giulia Guidi
TL;DR
Kernel K-means enables non-linear clustering but suffers from $O(n^2 d)$ preprocessing and $O(n^2)$ per-iteration costs, hindering scalability. The paper introduces a matrix-centric formulation that rewrites computations in terms of sparse-dense operations (SpMM/SpMV), enabling an efficient GPU implementation named Popcorn that avoids explicit high-dimensional projections. Popcorn achieves up to $123.8\times$ speedups over CPU implementations and up to $2.6\times$ over a dense GPU baseline, validating sparse linear algebra as a practical tool for high-performance clustering. A dynamic kernel selection strategy for kernel-matrix construction (GEMM vs SYRK) and extensive GPU-centric design underpin these gains, with public availability and plans for distributed extensions.
Abstract
K-means is a popular clustering algorithm with significant applications in numerous scientific and engineering areas. One drawback of K-means is its inability to identify non-linearly separable clusters, which may lead to inaccurate solutions in certain cases. Kernel K-means is a variant of classical K-means that can find non-linearly separable clusters. However, it scales quadratically with respect to the size of the dataset, taking several minutes to cluster even medium-sized datasets on traditional CPU-based machines. In this paper, we present a formulation of Kernel K-means using sparse-dense matrix multiplication (SpMM) and sparse matrix-vector multiplication (SpMV), and we show that our formulation enables the rapid implementation of a fast GPU-based version of Kernel K-means with little programming effort. Our implementation, named Popcorn, is the first open-source GPU-based implementation of Kernel K-means. Popcorn achieves a speedup of up to 123.8x over a CPU implementation of Kernel K-means and a speedup of up to 2.6x over a GPU implementation of Kernel K-means that does not use sparse matrix computations. Our results support the effectiveness of sparse matrices as tools for efficient parallel programming.
