Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Elliot L. Epstein; Rajat Vadiraj Dwaraknath; John Winnicki

Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Elliot L. Epstein, Rajat Vadiraj Dwaraknath, John Winnicki

TL;DR

The paper tackles the computational bottleneck of score-debiased KDE (SD-KDE), which improves bias and MISE rates over classical KDE but incurs a quadratic score computation. It introduces Flash-SD-KDE, a hardware-aware reformulation that reorders SD-KDE into Tensor Core–friendly GEMMs with streaming accumulation, enabling efficient 16-D density estimation on GPUs. It also presents a Laplace-corrected KDE variant (Flash-Laplace-KDE) that preserves leading bias reduction without requiring the empirical score, and shows that fused kernels reclaim memory bandwidth and achieve strong speedups. Across 16-D benchmarks and large-scale settings (up to roughly $10^6$ training points and roughly $10^5$ queries on a single GPU), Flash-SD-KDE attains up to tens of times faster performance than strong baselines and enables practical scale for score-debiased density estimation. The work highlights how hardware-aware reformulations can unlock practical nonparametric estimators and outlines directions for multi-GPU extensions and nonnegativity-preserving variants.

Abstract

Score-debiased kernel density estimation (SD-KDE) achieves improved asymptotic convergence rates over classical KDE, but its use of an empirical score has made it significantly slower in practice. We show that by re-ordering the SD-KDE computation to expose matrix-multiplication structure, Tensor Cores can be used to accelerate the GPU implementation. On a 32k-sample 16-dimensional problem, our approach runs up to $47\times$ faster than a strong SD-KDE GPU baseline and $3{,}300\times$ faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in $2.3$ s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.

Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

TL;DR

training points and roughly

queries on a single GPU), Flash-SD-KDE attains up to tens of times faster performance than strong baselines and enables practical scale for score-debiased density estimation. The work highlights how hardware-aware reformulations can unlock practical nonparametric estimators and outlines directions for multi-GPU extensions and nonnegativity-preserving variants.

Abstract

faster than a strong SD-KDE GPU baseline and

faster than scikit-learn's KDE. On a larger 1M-sample 16-dimensional task evaluated on 131k queries, Flash-SD-KDE completes in

s on a single GPU, making score-debiased density estimation practical at previously infeasible scales.

Paper Structure (22 sections, 29 equations, 7 figures, 1 table)

This paper contains 22 sections, 29 equations, 7 figures, 1 table.

Introduction
Related Work
Kernel density estimation.
Bias reduction and adaptive smoothing.
Score estimation and score-based modeling.
Fast evaluation of kernel sums and density models.
GPU acceleration and tensor-core programming.
Hardware
Method
Arithmetic intensity in $d$ dimensions
Total FLOPs.
Bytes moved.
Arithmetic intensity.
Laplace-corrected KDE
Connection to SD-KDE.
...and 7 more sections

Figures (7)

Figure 1: Runtime comparison for 16-D KDE/SD-KDE across $n_{\text{train}}$ up to $32{,}768$ ($n_{\text{test}} = n_{\text{train}}/8$).
Figure 2: Oracle error on a 16D mixture-of-Gaussians benchmark. We report MISE and MIAE versus $n_{\text{train}}$ for KDE, Flash-Laplace-KDE (fused Laplace correction), non-fused Laplace correction, and Flash-SD-KDE. The Laplace-corrected estimators can be slightly negative, so error is computed in a signed density manner.
Figure 3: Oracle error on a 1D mixture-of-Gaussians benchmark. We report MISE and MIAE versus $n_{\text{train}}$ for KDE, Flash-Laplace-KDE (fused Laplace correction), non-fused Laplace correction, and Flash-SD-KDE. The Laplace-corrected estimators can be slightly negative, so error is computed on the signed density.
Figure 4: Runtime and speedup for Laplace correction in 1D. The left panel shows total runtime for the fused Flash-Laplace-KDE kernel and a non-fused implementation. The right panel reports speedup ratios, including Flash-SD-KDE relative to Flash-Laplace-KDE for context.
Figure 5: Utilization (percentage of RTX A6000 Tensor Core peak) for the 16-D SD-KDE pipeline, computed via the flop model from Section \ref{['sec:method']}; bars are annotated with the observed runtimes (ms).
...and 2 more figures

Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

TL;DR

Abstract

Flash-SD-KDE: Accelerating SD-KDE with Tensor Cores

Authors

TL;DR

Abstract

Table of Contents

Figures (7)