Table of Contents
Fetching ...

Neural Empirical Bayes

Saeed Saremi, Aapo Hyvarinen

TL;DR

This work unifies kernel density estimation and empirical Bayes within a high-dimensional, concentration-of-measure framework, introducing a neural energy φ to approximate the score function and thereby enable end-to-end learning without explicit nonparametric density estimation. It develops NEBULA, a Hopfield-like associative memory driven by the gradient of φ, and a walk-jump sampling scheme that pairs Langevin dynamics with Robbins-style jumps to sample from the smoothed distribution and steer samples toward latent components. The paper analyzes manifold disintegration-expansion under Gaussian smoothing, introduces i-sphere interactions as a geometric mechanism for learning and memory, and demonstrates novel phenomena such as creative memories arising from highly overlapping spheres. Taken together, the approach provides a scalable, geometry-aware method for unsupervised learning, sampling, and memory-like computation in high dimensions.

Abstract

We unify $\textit{kernel density estimation}$ and $\textit{empirical Bayes}$ and address a set of problems in unsupervised learning with a geometric interpretation of those methods, rooted in the $\textit{concentration of measure}$ phenomenon. Kernel density is viewed symbolically as $X\rightharpoonup Y$ where the random variable $X$ is smoothed to $Y= X+N(0,σ^2 I_d)$, and empirical Bayes is the machinery to denoise in a least-squares sense, which we express as $X \leftharpoondown Y$. A learning objective is derived by combining these two, symbolically captured by $X \rightleftharpoons Y$. Crucially, instead of using the original nonparametric estimators, we parametrize $\textit{the energy function}$ with a neural network denoted by $φ$; at optimality, $\nabla φ\approx -\nabla \log f$ where $f$ is the density of $Y$. The optimization problem is abstracted as interactions of high-dimensional spheres which emerge due to the concentration of isotropic gaussians. We introduce two algorithmic frameworks based on this machinery: (i) a "walk-jump" sampling scheme that combines Langevin MCMC (walks) and empirical Bayes (jumps), and (ii) a probabilistic framework for $\textit{associative memory}$, called NEBULA, defined à la Hopfield by the $\textit{gradient flow}$ of the learned energy to a set of attractors. We finish the paper by reporting the emergence of very rich "creative memories" as attractors of NEBULA for highly-overlapping spheres.

Neural Empirical Bayes

TL;DR

This work unifies kernel density estimation and empirical Bayes within a high-dimensional, concentration-of-measure framework, introducing a neural energy φ to approximate the score function and thereby enable end-to-end learning without explicit nonparametric density estimation. It develops NEBULA, a Hopfield-like associative memory driven by the gradient of φ, and a walk-jump sampling scheme that pairs Langevin dynamics with Robbins-style jumps to sample from the smoothed distribution and steer samples toward latent components. The paper analyzes manifold disintegration-expansion under Gaussian smoothing, introduces i-sphere interactions as a geometric mechanism for learning and memory, and demonstrates novel phenomena such as creative memories arising from highly overlapping spheres. Taken together, the approach provides a scalable, geometry-aware method for unsupervised learning, sampling, and memory-like computation in high dimensions.

Abstract

We unify and and address a set of problems in unsupervised learning with a geometric interpretation of those methods, rooted in the phenomenon. Kernel density is viewed symbolically as where the random variable is smoothed to , and empirical Bayes is the machinery to denoise in a least-squares sense, which we express as . A learning objective is derived by combining these two, symbolically captured by . Crucially, instead of using the original nonparametric estimators, we parametrize with a neural network denoted by ; at optimality, where is the density of . The optimization problem is abstracted as interactions of high-dimensional spheres which emerge due to the concentration of isotropic gaussians. We introduce two algorithmic frameworks based on this machinery: (i) a "walk-jump" sampling scheme that combines Langevin MCMC (walks) and empirical Bayes (jumps), and (ii) a probabilistic framework for , called NEBULA, defined à la Hopfield by the of the learned energy to a set of attractors. We finish the paper by reporting the emergence of very rich "creative memories" as attractors of NEBULA for highly-overlapping spheres.

Paper Structure

This paper contains 9 sections, 36 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: (a) Samples from a 2D isotropic gaussian, obtained and rendered in the programming language Processing. (b) Schematic of an isotropic gaussian in high dimensions, where the concentration of norm is illustarted. (c) Schematic of the $i$-sphere, with samples $Y_{ij}=X_i+\varepsilon_j,~\varepsilon_j \sim N(0,\sigma^2 I_d)$. The arrows represent $-\nabla \phi$, evaluated on the sphere. The learning objective is encapsulated by $X \rightleftharpoons Y$, where the squared $\ell_2$ norm $\Vert X_i - \widehat{x}(Y_{ij})\Vert^2$ is the learning signal and minimized in expectation. Ignoring the other spheres, the learning objective is constructed such that $-\nabla \phi$ evaluated at $Y_{ij}$ points to $X_i$.
  • Figure 2: ( overlapping $i$-spheres) The extent of the overlap between $i$-sphere and $i'$-sphere is tuned by $\sigma$ in relation to $\chi_{ii'}=\Vert X_i-X_{i'}\Vert/(2\sqrt{d}).$ The scaling ($2\sqrt{d}$) is due to the fact that $N(0,\sigma^2 I_d) \approx \text{Unif}(\sigma \sqrt{d} S^{d-1})$ in high dimensions.
  • Figure 3: ($\sigma \approx \sigma_c$) Denoising performance of DEEN with a single jump for $\sigma=0.3$. The noisy pixel values are in the range $\tt [-1.200, 1.995]$, and the denoised ones are in $\tt [-0.0749,1.0539]$.
  • Figure 4: ($\sigma > 2 \sigma_c$) Here, $\sigma=0.7$. The noisy pixel values are in the range $\tt [-3.314 , 3.683]$, and the denoised ones are in $\tt [-0.2405, 1.2686]$. In this regime, the whole database is inside each $i$-sphere (see Figure \ref{['fig:twospheres']}c).
  • Figure 5: ($\sigma\approx \sigma_c$) Top row is $x_0\sim f_X$, sampled from the handwritten digit database which DEEN was trained on. The Langevin MCMC was intialized at $y_0 = x_0 +\varepsilon$, $\varepsilon \sim N(0,\sigma^2 I_d)$, where $\sigma$ was the same value of $\sigma$ which DEEN had been trained on. The samples $y_t$ are are not shown. The jumps are shown in multiples of $\tau_0 = 10^4$, and the step size was $\delta = \sigma/100.$ Here, $\sigma=0.3$. The pixel values are in the range [-0.07, 1.10].
  • ...and 5 more figures

Theorems & Definitions (9)

  • Remark 1: $X$ and $Y$
  • Remark 2: DEEN
  • Definition 3: $i$-sphere
  • Remark 4
  • Definition 5: $i$-sphere interactions
  • Remark 6
  • Remark 7: implicit vs. explicit parameterization
  • Remark 8
  • Definition 9