Table of Contents
Fetching ...

Feature-guided score diffusion for sampling conditional densities

Zahra Kadkhodaie, Stéphane Mallat, Eero P. Simoncelli

TL;DR

It is demonstrated that the algorithm can generate high quality and diverse samples from the conditioning class using feature vectors interpolated between those of the training set, demonstrating out-of-distribution generalization.

Abstract

Score diffusion methods can learn probability densities from samples. The score of the noise-corrupted density is estimated using a deep neural network, which is then used to iteratively transport a Gaussian white noise density to a target density. Variants for conditional densities have been developed, but correct estimation of the corresponding scores is difficult. We avoid these difficulties by introducing an algorithm that guides the diffusion with a projected score. The projection pushes the image feature vector towards the feature vector centroid of the target class. The projected score and the feature vectors are learned by the same network. Specifically, the image feature vector is defined as the spatial averages of the channels activations in select layers of the network. Optimizing the projected score for denoising loss encourages image feature vectors of each class to cluster around their centroids. It also leads to the separations of the centroids. We show that these centroids provide a low-dimensional Euclidean embedding of the class conditional densities. We demonstrate that the algorithm can generate high quality and diverse samples from the conditioning class. Conditional generation can be performed using feature vectors interpolated between those of the training set, demonstrating out-of-distribution generalization.

Feature-guided score diffusion for sampling conditional densities

TL;DR

It is demonstrated that the algorithm can generate high quality and diverse samples from the conditioning class using feature vectors interpolated between those of the training set, demonstrating out-of-distribution generalization.

Abstract

Score diffusion methods can learn probability densities from samples. The score of the noise-corrupted density is estimated using a deep neural network, which is then used to iteratively transport a Gaussian white noise density to a target density. Variants for conditional densities have been developed, but correct estimation of the corresponding scores is difficult. We avoid these difficulties by introducing an algorithm that guides the diffusion with a projected score. The projection pushes the image feature vector towards the feature vector centroid of the target class. The projected score and the feature vectors are learned by the same network. Specifically, the image feature vector is defined as the spatial averages of the channels activations in select layers of the network. Optimizing the projected score for denoising loss encourages image feature vectors of each class to cluster around their centroids. It also leads to the separations of the centroids. We show that these centroids provide a low-dimensional Euclidean embedding of the class conditional densities. We demonstrate that the algorithm can generate high quality and diverse samples from the conditioning class. Conditional generation can be performed using feature vectors interpolated between those of the training set, demonstrating out-of-distribution generalization.

Paper Structure

This paper contains 24 sections, 14 equations, 13 figures, 4 algorithms.

Figures (13)

  • Figure 1: Illustration of feature-guided score diffusion. Left: Score diffusion of a mixture of densities computes trajectories (black) that map samples of a Gaussian white noise (blue disk) to samples of two complex conditional densities (orange or green). Right: The feature space $\phi(x)$ defines a Euclidean embedding in which each mixture component is well separated (orange/green ellipses). In the embedding space, mixture trajectories (black) are similar at high noise variance $\sigma^2$, and bifurcate, moving toward different components at lower noise levels biroli2024dynamical. In our method, feature trajectories (orange/green) are forced toward the feature centroids ($\phi_y$ or $\phi_{y'}$, on right) of the corresponding conditional density ($p_y$ or $p_{y'}$, on left). These feature trajectories are used to guide the trajectories of $x_\sigma$ in the signal space (orange/green, left) toward the corresponding conditional densities.
  • Figure 2: Feature guided denoising results at two noise levels (left: $\sigma=1$, right: $\sigma=0.5$). Leftmost column of each panel shows noisy images, drawn from 4 classes. Top row (green boxes) shows example conditioning images, from the same 4 classes. Columns under each show corresponding denoising results. Diagonal entries (red boxes) indicate images denoised with correct conditioning (conditioning image from same class as noisy image), whereas off-diagonal entries are incorrectly conditioned. Rightmost column of each panel shows denoising results using the (unconditioned) mixture denoiser (orange boxes). At high noise levels, conditioning on the correct class improves results significantly compared to the mixture model. Conditioning on the wrong class degrades performance, introducing features from the conditioning class. At smaller noise levels, feature guided and mixture denoisers produce similar outputs, but the effect of incorrect conditioning is still visible.
  • Figure 3: Left: Improvement in peak signal to noise ratio (PSNR) at different noise levels, of the conditional model (discs) relative to the unconditioned mixture model (stars), averaged over samples from all classes. Right: Comparison of conditional model (discs) with a denoiser optimized for a single class $y_0$ (stars). Upper points correspond to denoising of images from class $y_0$, with correct conditioning. Lower points correspond to denoising of images from other classes, $y \neq y_0$, with incorrect conditioning.
  • Figure 4: Concentration of image feature vectors $\phi(x)$ within class, and separation of average vectors $\phi_y$ between classes. Top row shows results for the unconditioned mixture model and the bottom row shows results for the conditional model. Left column: distribution of squared Euclidean distances between pairs of class feature vectors $\phi_y$ and $\phi_y'$ (orange histogram) and distribution of geometric mean of variances of feature vectors of images from $p_{y}(x)$ and $p_{y'}(x)$ (gray histogram). Middle column: Image and class embedding correlations in different layers of the UNet architecture. The mixture model does not separate classes well, while the conditional model separates classes significantly, especially in the middle layer. Right column: scatter plot of components of $\phi(x_1)$ vs $\phi(x_2)$ in the middle layer for $x_1 \in y_1$ and $x_2 \in y_2$ . Example training images from $y_1$ and $y_2$ are shown along the axes. The image embeddings in the conditional model are separated, while there is very little separation in the mixture model.
  • Figure 5: Verification of Euclidean embedding (\ref{['Euclidean-Embedd']}). Density distance (\ref{['eq:density-distance']}), which bounds the symmetrized KL divergence between the two conditional densities, is well-correlated with the squared Euclidean distance between the corresponding mean feature vectors in the embedding space. Image pairs on the left are drawn from the closest three class pairs (red points), and those on the right are drawn from the most distant (blue points).
  • ...and 8 more figures