Table of Contents
Fetching ...

From Kernels to Attention: A Transformer Framework for Density and Score Estimation

Vasily Ilin, Peter Sushko

TL;DR

We address the problem of jointly estimating a density $f(x)$ and its score $\nabla_x \log f(x)$ from i.i.d. samples. The method uses a permutation- and affine-equivariant Transformer with cross-attention to query arbitrary points and jointly predict $f$ and $\nabla \log f$, trained on synthetic Gaussian mixtures to generalize across $n$ and $d$. A key theoretical result shows that self-attention can recover normalized KDE weights, linking classical kernel smoothing to the learned operator, while empirically the model outperforms KDE and SD-KDE in accuracy and runtime and provides accurate plug-ins for entropy, Fisher information, and PDE solvers. Overall, the work unifies classical nonparametric density estimation with modern symmetry-preserving neural operators, delivering scalable, data-adaptive density and score estimation across varying sample sizes and dimensions.

Abstract

We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density $f(x)$ and its score $\nabla_x \log f(x)$ directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.

From Kernels to Attention: A Transformer Framework for Density and Score Estimation

TL;DR

We address the problem of jointly estimating a density and its score from i.i.d. samples. The method uses a permutation- and affine-equivariant Transformer with cross-attention to query arbitrary points and jointly predict and , trained on synthetic Gaussian mixtures to generalize across and . A key theoretical result shows that self-attention can recover normalized KDE weights, linking classical kernel smoothing to the learned operator, while empirically the model outperforms KDE and SD-KDE in accuracy and runtime and provides accurate plug-ins for entropy, Fisher information, and PDE solvers. Overall, the work unifies classical nonparametric density estimation with modern symmetry-preserving neural operators, delivering scalable, data-adaptive density and score estimation across varying sample sizes and dimensions.

Abstract

We introduce a unified attention-based framework for joint score and density estimation. Framing the problem as a sequence-to-sequence task, we develop a permutation- and affine-equivariant transformer that estimates both the probability density and its score directly from i.i.d. samples. Unlike traditional score-matching methods that require training a separate model for each distribution, our approach learns a single distribution-agnostic operator that generalizes across densities and sample sizes. The architecture employs cross-attention to connect observed samples with arbitrary query points, enabling generalization beyond the training data, while built-in symmetry constraints ensure equivariance to permutation and affine transformations. Analytically, we show that the attention weights can recover classical kernel density estimation (KDE), and verify it empirically, establishing a principled link between classical KDE and the transformer architecture. Empirically, the model achieves substantially lower error and better scaling than KDE and score-debiased KDE (SD-KDE), while exhibiting better runtime scaling. Together, these results establish transformers as general-purpose, data-adaptive operators for nonparametric density and score estimation.

Paper Structure

This paper contains 22 sections, 4 theorems, 35 equations, 14 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.1

Let $f$ be a differentiable density, and $X=(x_1,\dots,x_n)^T$ be its iid sample. Define Let $P\in\mathbb{R}^{n\times n}$ be a permutation matrix, $A\in\mathbb{R}^{d\times d}$ be invertible, and $\mu\in\mathbb{R}^d$. Then

Figures (14)

  • Figure 1: The forward pass implementing affine equivariance.
  • Figure 2: Attention visualization of the transformer model. The top panel shows the average attention scores in layer 0 with respect to the chosen point x. The heatmaps show the attention matrix and the normalized KDE matrix $D_{i,j} \propto e^{-\|x_i-x_j\|^2_2}$. The scatter plots show very high agreement between attention scores and $D$. The bottom four panels visualize the attention scores of individual heads, demonstrating emergent head specialization.
  • Figure 3: Score estimation comparison between Silverman KDE silverman2018density and our transformer model. The transformer is more accurate, especially in the sparse regions. We plot the negated score for easier viewing.
  • Figure 4: Left: MSE of score estimation in dimensions 1 and 10 on a 3-modal GMM using Transformer and KDE. The Transformer has excellent scaling in both dimension $d$ and the sample size $n$. Right: MSE of score estimation using the Transformer on GMMs with different numbers of modes. Despite being trained only on GMMs with 1-10 modes, and $n=2048$, the model exhibits excellent generalization.
  • Figure 5: Computing relative Fisher information. Our model predicts $\nabla \log g(x_i)$ at query points $x_i$ via cross-attention with the samples $y_i\sim g$. These predicted scores can be used to estimate the relative Fisher information.
  • ...and 9 more figures

Theorems & Definitions (6)

  • Proposition 3.1: Permutation and affine equivariance of density and score evaluation
  • Proposition 3.2: Attention computes KDE
  • Proposition 1.1: Permutation and affine equivariance of density and score evaluation
  • proof
  • Proposition 1.2: Attention computes KDE
  • proof