From Kernels to Attention: A Transformer Framework for Density and Score Estimation

Vasily Ilin; Peter Sushko

From Kernels to Attention: A Transformer Framework for Density and Score Estimation

Vasily Ilin, Peter Sushko

TL;DR

We address the problem of jointly estimating a density $f(x)$ and its score $\nabla_x \log f(x)$ from i.i.d. samples. The method uses a permutation- and affine-equivariant Transformer with cross-attention to query arbitrary points and jointly predict $f$ and $\nabla \log f$, trained on synthetic Gaussian mixtures to generalize across $n$ and $d$. A key theoretical result shows that self-attention can recover normalized KDE weights, linking classical kernel smoothing to the learned operator, while empirically the model outperforms KDE and SD-KDE in accuracy and runtime and provides accurate plug-ins for entropy, Fisher information, and PDE solvers. Overall, the work unifies classical nonparametric density estimation with modern symmetry-preserving neural operators, delivering scalable, data-adaptive density and score estimation across varying sample sizes and dimensions.

Abstract

We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density $f(x)$ and its score $\nabla_x \log f(x)$ directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.

From Kernels to Attention: A Transformer Framework for Density and Score Estimation

TL;DR

We address the problem of jointly estimating a density

and its score

from i.i.d. samples. The method uses a permutation- and affine-equivariant Transformer with cross-attention to query arbitrary points and jointly predict

and

, trained on synthetic Gaussian mixtures to generalize across

and

. A key theoretical result shows that self-attention can recover normalized KDE weights, linking classical kernel smoothing to the learned operator, while empirically the model outperforms KDE and SD-KDE in accuracy and runtime and provides accurate plug-ins for entropy, Fisher information, and PDE solvers. Overall, the work unifies classical nonparametric density estimation with modern symmetry-preserving neural operators, delivering scalable, data-adaptive density and score estimation across varying sample sizes and dimensions.

Abstract

and its score

directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.

From Kernels to Attention: A Transformer Framework for Density and Score Estimation

TL;DR

Abstract

From Kernels to Attention: A Transformer Framework for Density and Score Estimation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (6)