Table of Contents
Fetching ...

Density Estimation on Rectifiable Sets

Jack Kendrick

TL;DR

This work extends kernel density estimation to data supported on $d$-rectifiable sets, addressing the slow convergence of classical KDE in high ambient dimensions by using a density estimator ${\hat{p}}_n(x) = \frac{1}{h_n^d}\sum_{i=1}^n K(\|X_i-x\|/h_n)$ tailored to the intrinsic dimension $d$. Under an approximate-tangent-space condition with parameter $m>0$, the estimator's mean-squared error decays as ${\rm MSE}[{\hat{p}}_n(x)] = O\left(\frac{1}{n^{2m/(d+2m)}}\right)$, recovering known results on manifolds and extending to algebraic and semi-algebraic sets. When the support is locally a smooth manifold almost everywhere, and with sufficient smoothness of $p$ and $K$, the method achieves the classical rate ${\rm MSE} = O\left(\frac{1}{n^{4/(d+4)}}\right)$ with $h_n \asymp n^{-1/(d+4)}$, reflecting improved tangent-space approximations. A numerical example on $d$-sparse data demonstrates that the convergence rate depends on the intrinsic dimension $d$ but not on the ambient dimension $D$, illustrating practical applicability to high-dimensional, low-dimensional-structure data such as sparsity and low-rank models.

Abstract

Kernel density estimation is a popular method for estimating unseen probability distributions. However, the convergence of these classical estimators to the true density slows down in high dimensions. Moreover, they do not define meaningful probability distributions when the intrinsic dimension of data is much smaller than its ambient dimension. We build on previous work on density estimation on manifolds to show that a modified kernel density estimator converges to the true density on $d-$rectifiable sets. As a special case, we consider algebraic varieties and semi-algebraic sets and prove a convergence rate in this setting. We conclude the paper with a numerical experiment illustrating the convergence of this estimator on sparse data.

Density Estimation on Rectifiable Sets

TL;DR

This work extends kernel density estimation to data supported on -rectifiable sets, addressing the slow convergence of classical KDE in high ambient dimensions by using a density estimator tailored to the intrinsic dimension . Under an approximate-tangent-space condition with parameter , the estimator's mean-squared error decays as , recovering known results on manifolds and extending to algebraic and semi-algebraic sets. When the support is locally a smooth manifold almost everywhere, and with sufficient smoothness of and , the method achieves the classical rate with , reflecting improved tangent-space approximations. A numerical example on -sparse data demonstrates that the convergence rate depends on the intrinsic dimension but not on the ambient dimension , illustrating practical applicability to high-dimensional, low-dimensional-structure data such as sparsity and low-rank models.

Abstract

Kernel density estimation is a popular method for estimating unseen probability distributions. However, the convergence of these classical estimators to the true density slows down in high dimensions. Moreover, they do not define meaningful probability distributions when the intrinsic dimension of data is much smaller than its ambient dimension. We build on previous work on density estimation on manifolds to show that a modified kernel density estimator converges to the true density on rectifiable sets. As a special case, we consider algebraic varieties and semi-algebraic sets and prove a convergence rate in this setting. We conclude the paper with a numerical experiment illustrating the convergence of this estimator on sparse data.

Paper Structure

This paper contains 5 sections, 7 theorems, 40 equations, 1 figure.

Key Result

Theorem 1.1

Let $\Omega\subset{\mathbb R}^D$ be a $d-$rectifiable set and $P$ a probability measure on $\Omega$ with density $p$ with respect to the $d-$dimensional Hausdorff measure ${\mathcal{H}}^d.$ Assume that the kernel $K$ is continuous and vanishes outside of $[0, 1]$ Then, when bandwidth parameters $h_n holds with probability 1, where $m$ is a measure of how well $\Omega$ is approximated by its approx

Figures (1)

  • Figure 1: The empirical mean square error of the predictor ${\hat{p}}_n$ in various ambient dimensions and using varying sizes of training sets. The shaded regions indicate a 95% confidence interval. Note that the curve corresponding to each ambient dimension $D$ has the same shape, illustrating that the convergence rate is independent of the ambient dimension and depends only on the intrinsic dimension $d.$

Theorems & Definitions (13)

  • Theorem 1.1: KDE on rectifiable sets
  • Theorem 1.2: KDE on a.e. smooth spaces
  • Definition 2.1: $d-$Rectifiable
  • Definition 2.2: Approximate tangent space
  • Theorem 2.3: simon-gmt
  • Lemma 2.4
  • proof
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • ...and 3 more