Table of Contents
Fetching ...

Lifting Architectural Constraints of Injective Flows

Peter Sorrenson, Felix Draxler, Armand Rousselot, Sander Hummerich, Lea Zimmermann, Ullrich Köthe

TL;DR

This work lifts constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures and performs extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.

Abstract

Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.

Lifting Architectural Constraints of Injective Flows

TL;DR

This work lifts constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures and performs extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.

Abstract

Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.
Paper Structure (54 sections, 5 theorems, 93 equations, 13 figures, 11 tables)

This paper contains 54 sections, 5 theorems, 93 equations, 13 figures, 11 tables.

Key Result

Lemma C.1

Suppose the matrix $A$ depends on a variable $x$. Then we have the following expression for the derivative of the projection operator $A^\dagger A$:

Figures (13)

  • Figure 1: Free-form injective flow (FIF) training and inference.(Left) We combine a reconstruction loss ${\mathcal{L}}_\textrm{recon.}$ with a novel maximum likelihood loss $\tilde{\mathcal{L}}_\textrm{NLL}$ to obtain an injective flow without architectural constraints. (Right) We generate novel samples by decoding standard normal latent samples with our best-performing models on CelebA and MNIST. The reconstructions shown are on CelebA validation data, the samples are uncurated samples from our models.
  • Figure 2: Naive training of autoencoders with negative log-likelihood (NLL, see \ref{['sec:nll-problems']}) leads to pathological solutions (left). Starting with the initialization ($t=0$, black), gradient steps increase the curvature of the learnt manifold ($t=1, 2$, orange). This reduces NLL because the entropy of the projected data is reduced, by moving the points closer to one another. This effect is stronger than the reconstruction loss. We fix this problem by evaluating the volume change off-manifold (right). This moves the manifold closer to the data and reduces the curvature ($t=1, 2$, green), until it eventually centers the manifold on the data with zero curvature ($t = \infty$, green). Light lines show the set of points which map to the same latent point. Data is projected onto the $t = 2$ manifold.
  • Figure 3: Learning a noisy 2-D sinusoid with a 1-D latent space for different reconstruction weights $\beta$. Color codes denote the value of the latent variable at each location. When the reconstruction term has low weight (left), the autoencoder learns to throw away information about the position along the sinusoid and only retains the orthogonal noise. Only sufficiently high weights (right) result in the desired solution, where the decoder spans the sinusoid manifold. The middle plot shows the tradeoff between reconstruction error and NLL as we transition between these regimes (box plots indicate variability across runs).
  • Figure 4: Representation of ill-defined probability density $\tilde{p}(x) \propto p(\hat{x}) e^{-\beta \lVert \hat{x} - x \rVert^2}$ (left and center). Solid black lines denote the manifold, dashed lines are a constant distance from the manifold. The probability density is constant along the manifold. The width of the cyan bands is proportional to $e^{-\beta \lVert \hat{x} - x \rVert^2}$ and represents the probability density along the on- and off-manifold contours. While the density is reasonable for a flat manifold (left), note that the amount of probability mass associated with a region of the manifold (bounded by solid lines) is larger at some points off the manifold than on it when the manifold has curvature (center). This behavior can lead to divergent solutions when optimizing for likelihood and should be compensated for. The appropriate compensation factor is the ratio of the volume of a small region on the manifold (small blue square embedded in green manifold, right) to the equivalent region off the manifold (large blue square, right). The blue arrows represent an orthonormal frame on the manifold, and the equivalent frame in the off-manifold region.
  • Figure 5: Plot of $f(\lambda_i) = \log \lambda_i - \lambda_i/\sigma^2$ with $\sigma = 1$, showing maximum value at $\lambda_i = \sigma^2$ and unbounded behavior on either side.
  • ...and 8 more figures

Theorems & Definitions (10)

  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Lemma C.3
  • proof
  • Theorem C.4
  • proof
  • Theorem D.1
  • proof