Table of Contents
Fetching ...

Uncertainty in latent representations of variational autoencoders optimized for visual tasks

Josefina Catoni, Domonkos Martos, Ferenc Csikor, Enzo Ferrante, Diego H. Milone, Balázs Meszéna, Gergő Orbán, Rodrigo Echeveste

TL;DR

Inspiration from classical computer vision is drawn to introduce an inductive bias into the VAE by incorporating a global explaining-away latent variable, which remedies defective inference in VAEs and establishes EA-VAEs as reliable tools to perform inference under deep generative models with appropriate estimates of uncertainty.

Abstract

Deep Generative Models (DGMs) can learn flexible latent variable representations of images while avoiding intractable computations, common in Bayesian inference. However, investigating the properties of inference in Variational Autoencoders (VAEs), a major class of DGMs, reveals severe problems in their uncertainty representations. Here we draw inspiration from classical computer vision to introduce an inductive bias into the VAE by incorporating a global explaining-away latent variable, which remedies defective inference in VAEs. Unlike standard VAEs, the Explaing-Away VAE (EA-VAE) provides uncertainty estimates that align with normative requirements across a wide spectrum of perceptual tasks, including image corruption, interpolation, and out-of-distribution detection. We find that restored inference capabilities are delivered by developing a motif in the inference network (the encoder) which is widespread in biological neural networks: divisive normalization. Our results establish EA-VAEs as reliable tools to perform inference under deep generative models with appropriate estimates of uncertainty.

Uncertainty in latent representations of variational autoencoders optimized for visual tasks

TL;DR

Inspiration from classical computer vision is drawn to introduce an inductive bias into the VAE by incorporating a global explaining-away latent variable, which remedies defective inference in VAEs and establishes EA-VAEs as reliable tools to perform inference under deep generative models with appropriate estimates of uncertainty.

Abstract

Deep Generative Models (DGMs) can learn flexible latent variable representations of images while avoiding intractable computations, common in Bayesian inference. However, investigating the properties of inference in Variational Autoencoders (VAEs), a major class of DGMs, reveals severe problems in their uncertainty representations. Here we draw inspiration from classical computer vision to introduce an inductive bias into the VAE by incorporating a global explaining-away latent variable, which remedies defective inference in VAEs. Unlike standard VAEs, the Explaing-Away VAE (EA-VAE) provides uncertainty estimates that align with normative requirements across a wide spectrum of perceptual tasks, including image corruption, interpolation, and out-of-distribution detection. We find that restored inference capabilities are delivered by developing a motif in the inference network (the encoder) which is widespread in biological neural networks: divisive normalization. Our results establish EA-VAEs as reliable tools to perform inference under deep generative models with appropriate estimates of uncertainty.
Paper Structure (45 sections, 22 equations, 11 figures, 2 tables)

This paper contains 45 sections, 22 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Characteristic properties of inferred posteriors as contrast is varied.a, Expected behaviour of the posteriors of a pair of latent variables in natural images when contrast is systematically increased. Three example images (green, purple and mustard frames) are presented at varying contrast levels. As contrast increases, the inferred posteriors for each image (colored solid lines) progressively deviate both from the prior (dashed line) and from one another. There are two different sources of variance in latent representations. The signal variance refers to how the posterior mean changes with the input. The noise variance (or posterior variance) refers to the remaining uncertainty after having observed a given stimulus. Similarly, we call signal mean to the average distance between the prior’s mean and the posteriors’ mean. The noise mean will be zero in our models, as is usually assumed for VAEs, without loss of generality. b, Cartoon of expected behaviour of signal mean, signal variance and noise variance as a function of contrast showing qualitative trends of these quantities. c,d, Signal mean, signal variance and noise variance of the inferred latent posteriors in natural image-trained VAE (c) and EA-VAE (d, respectively), as a function of image contrast (Methods). Prior mean and variance are shown in dotted black lines. e, Comparison of the inference and reconstruction process in the VAE (left) and EA-VAE (right). In the VAE a single pool of latent variables $\mathbf{z}$ is inferred, while in the EA-VAE there is an additional global latent variable $s$ which acts multiplicatively on the latents, $\mathbf{z}$. f, Example test image patches and their respective reconstruction through VAE and EA-VAE. g, Receptive filters of example latents in the VAE and EA-VAE. h, Inferred posterior mean of the scaling variable for individual patches (dots) in the EA-VAE model, as a function of the measured contrast of these images.
  • Figure 2: Characterization of posterior width for systematic image manipulations that affect inference uncertainty.a, Latent posterior width (mustard dots, with color matching that of the frame of average images) for uninformative images (top, mustard framed images) in the MNIST (left) and ChestMNIST (right) data sets. The distribution of noise posterior widths (std.) for in-distribution test images for standard VAE and EA-VAE are presented as violin plots. Prior uncertainty is shown as a reference (dashed line). b, Top left: Average posterior for images with the same label ('0' in green and '1' in orange) in an MNIST-trained model. Top right: Gradual morphing between representative image samples from digit '0' and digit '1'. Illustration of the expected changes of the posterior width is shown above. Bottom: Posterior widths for the morphed digits with standard VAE (left) and EA-VAE (right). c, Posterior width and the magnitude of the scaling variable of ChestMNIST-trained standard VAE and EA-VAE models for images increasingly corrupted either by Gaussian blurring (top) or additive pixel noise (bottom). d, Posterior width and the magnitude of the scaling variable for in-distribution (ID) and out-of-distribution (OOD) examples when trained on MNIST both for standard VAE (blue) and EA-VAE (red). Top: Examples of images from the same domain (gray frames) and different domains (mustard frames) as inputs, and reconstructions by the standard VAE (second row) and EA-VAE (third row) respectively. Bottom: uncertainty quantified by the posterior width and mean of scaling variable posterior for individual within-distribution or out-of-distribution images. As a reference, a dashed black line indicates the prior uncertainty.
  • Figure 3: Characterization of learned representations for different training data sets. Beyond natural image patches, contrast-augmented versions of the MNIST and Fashion MNIST databases are shown. a, Latent posterior signal mean as a function of the measured contrast of a large set of test images for both the VAE and the EA-VAE models. b, Inferred posterior mean of the scaling variable for individual images (dots) in the EA-VAE model, as a function of the measured contrast of these images. Average mean shown as solid line. c, Direct comparison of the learned representation of an cMNIST-trained standard VAE (blue) and an EA-VAE (red) both featuring a three-dimensional latent space. Top panels: Cross-sections of the latent space (see labels above panels). Dots show the contrast of images generated from individual states of the latent space, with dot size proportional to the contrast of the generated image. Note that cross sections for the EA-VAE correspond to different levels of the scaling variable. Bottom panel: Contrast of images generated from the three-dimensional latent space of the standard VAE model at a fixed distance from the origin. The fixed distance was 2SD of the prior distribution.
  • Figure 4: Characterization of a MLP trained for classification task of MNIST handwritten digits through latent posterior representations of VAE and EA-VAE.a, Behaviour of trained MLP when gradually morphing between image samples from digit '2' to digit '5'. Top panels: Histograms of output predicted probabilities averaged for different samples of digits '2' and '5' when trained with standard VAE (left) and EAVAE (right) latent representations. Bottom panel: Entropy of predicted probability distribution as a function of combination weight of labels. b, Behaviour of trained MLP when testing with out of distribution chestMMNIST (Top) and pixel-shuffled MNIST (Bottom) images. Left columns: Histograms show the distribution of predicted probabilities of labels of an example image, averaged for different samples of the posterior distribution of that image. The entropy of that averaged distribution is computed for both models. Right column: Entropy of predicted probability distribution for MLP trained with standard VAE representation vs EA-VAE representations. Each yellow dot represents a single x-ray image. Black dot in mean value. Identity in dark grey slashed line. Shading corresponds to the upper limit determined by the uniform distribution.
  • Figure 5: Divisive normalization implemented by the recognition model.a, Conditional distribution ($p(L_1\,|\, L_2,x)$) of linear filter responses of a pair of example latent variables over natural images. Intensity of histogram is proportional to the probability. Linear filter responses are calculated as dot product between an image and the receptive field of the neuron (Methods). Vertical and horizontal axes correspond to $L_1$ and $L_2$ response intensities. Light and dark colored bars correspond to two central and two flanking quadrants of $L_2$ activation, respectively. b, Non-linear dependence of latent posteriors. Dependence is characterized by 100 randomly chosen pairs of latents using natural image posteriors. Standard deviation of the conditional distribution of posterior means are shown close to zero activation of the conditioned latent variable (two central quadrants, marked with matching color as panel a) and at high-intensity activation (two flanking quadrants, color as on a). c, Distribution of learned weights of the divisive normalization model for the standard VAE (blue) and EA-VAE (red) models. Dark line shows the average of five fits, individual fits are shown with light lines. d, Evaluation of divisive normalization through contrasting linear responses with posterior means in the standard VAE (blue) and EA-VAE (blue) models. Dots represent responses of individual latents for a particular natural image, colors distinguish responses to individual natural images, color lightness set according to image contrast. Lines show linear fits to the responses. Slope of the fit is designated as the normalization index, such that a smaller index corresponds to a decreased dependence of the posterior on the linear response, i.e. stronger divisive normalization. Normalization indices, and contrasts of the five example images are shown in the legend. e, Normalization index in the standard VAE (blue) and EA-VAE (blue) models as a function of the normalization factor of the divisive normalization model, i.e. the Euclidean mean posterior means. Dots correspond to individual images, light colors correspond to images with contrast levels below observation noise (Methods). f, Normalization index in the standard VAE model (blue) and EA-VAE model (red) as a function of image contrast. Dots represent the normalization index for a particular natural image. The vertical line signals the set observation noise.
  • ...and 6 more figures