Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores
Elliot L. Epstein, Rajat Vadiraj Dwaraknath, John Winnicki
TL;DR
The paper tackles the computational bottleneck of score-debiased KDE (SD-KDE), which improves bias and MISE rates over classical KDE but incurs a quadratic score computation. It introduces Flash-SD-KDE, a hardware-aware reformulation that reorders SD-KDE into Tensor Core–friendly GEMMs with streaming accumulation, enabling efficient 16-D density estimation on GPUs. It also presents a Laplace-corrected KDE variant (Flash-Laplace-KDE) that preserves leading bias reduction without requiring the empirical score, and shows that fused kernels reclaim memory bandwidth and achieve strong speedups. Across 16-D benchmarks and large-scale settings (up to roughly $10^6$ training points and roughly $10^5$ queries on a single GPU), Flash-SD-KDE attains up to tens of times faster performance than strong baselines and enables practical scale for score-debiased density estimation. The work highlights how hardware-aware reformulations can unlock practical nonparametric estimators and outlines directions for multi-GPU extensions and nonnegativity-preserving variants.
Abstract
Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47\times$ faster than a strong SD-KDE GPU baseline and $3{,}300\times$ faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.
