Table of Contents
Fetching ...

Learning normalized image densities via dual score matching

Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli

TL;DR

A new framework for learning normalized energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score to obtain a cross-entropy value comparable to the state of the art.

Abstract

Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: log probabilities estimated with two networks trained on non-overlapping data subsets are nearly identical. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary substantially depending on image content, in contrast with conventional assumptions such as concentration of measure or support on a low-dimensional manifold.

Learning normalized image densities via dual score matching

TL;DR

A new framework for learning normalized energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score to obtain a cross-entropy value comparable to the state of the art.

Abstract

Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired from diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: log probabilities estimated with two networks trained on non-overlapping data subsets are nearly identical. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary substantially depending on image content, in contrast with conventional assumptions such as concentration of measure or support on a low-dimensional manifold.

Paper Structure

This paper contains 45 sections, 29 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Comparison of single and dual score matching on recovering the energy of a scale mixture of two Gaussians in $d=1000$ dimensions. Experimental details are provided in \ref{['app:addtional_details']}. Left: Radial slices of the log probability. The single score matching estimate (green dashed curve) fails to recover the true energy (blue solid curve), even after global normalization (green dotted curve), while dual score matching (red dashed curve) succeeds. Middle: Radial components of the scores. Single score matching learns an accurate score over the support of the data (blue bar plot) but not outside of it. Right: Energy landscape across space and time (noise level) for a mixture of two Gaussians in one dimension. The direct path between the modes at $t=0$ crosses a large energy barrier (green curve), which is alleviated on a path that is not restricted to $t=0$ (red curve).
  • Figure 2: Convergence of energy estimates. The data set is split into two halves (denoted A and B), and separate energy models are trained on $N$ samples drawn from each half. Each scatterplot compares the energy estimates of the two models at $t=0$, over all $2N$ training images. As $N$ increases, the energy estimates of the two models converges for all images.
  • Figure 3: Histogram of log probabilities of images in the ImageNet dataset. Color-coded arrows indicate values for the example images on the right, and the leftmost (brown) and rightmost (green) arrows indicate values for a uniform noise image in $[0,1]$ and a constant image of intensity $0.5$, respectively. The distribution is well-fit by a Gumbel distribution (red line). Additional examples of images organized by probability are shown in \ref{['fig:low-energy-images', 'fig:high-energy-images', 'fig:linear-energy-images']} (\ref{['app:more_images']}).
  • Figure 4: Influence of image statistics on probability. Left.$\log p_\theta(a x + b)$ as a function of $a$ and $b$. Middle. Horizontal slice ($b = \frac{1}{2}$) of the left panel for the example images of \ref{['fig:energy_distribution']}. Right. Log probability as a function of sparsity, measured as the participation ratio of wavelet coefficients.
  • Figure 5: Left: Two hypothetical examples illustrating how local effective dimensionality depends on the scale of the neighborhood. For both examples, support of density corresponds to the blue regions. In the left example, the dimensionality around the red point decreases with scale (from 2 down to 0), while the opposite is true for the right example. Right: Log probability and effective dimensionality as a function of noise level. Colored lines correspond to different example images $x$ (shown in right panel of \ref{['fig:energy_distribution']}), while the dashed black line shows the average over the ImageNet test set. The vertical gray line indicates the minimum noise level presented during training ($t = 10^{-9})$, and the horizontal gray line the ambient dimensionality of the dataset ($d = 4096$).
  • ...and 3 more figures