Table of Contents
Fetching ...

Understanding Variational Autoencoders with Intrinsic Dimension and Information Imbalance

Charles Camboulin, Diego Doimo, Aldo Glielmo

TL;DR

It is shown that VAEs undergo a transition in behaviour once the bottleneck size is larger than the ID of the data, manifesting in a double hunchback ID profile and a qualitative shift in information processing as captured by the II.

Abstract

This work presents an analysis of the hidden representations of Variational Autoencoders (VAEs) using the Intrinsic Dimension (ID) and the Information Imbalance (II). We show that VAEs undergo a transition in behaviour once the bottleneck size is larger than the ID of the data, manifesting in a double hunchback ID profile and a qualitative shift in information processing as captured by the II. Our results also highlight two distinct training phases for architectures with sufficiently large bottleneck sizes, consisting of a rapid fit and a slower generalisation, as assessed by a differentiated behaviour of ID, II, and KL loss. These insights demonstrate that II and ID could be valuable tools for aiding architecture search, for diagnosing underfitting in VAEs, and, more broadly, they contribute to advancing a unified understanding of deep generative models through geometric analysis.

Understanding Variational Autoencoders with Intrinsic Dimension and Information Imbalance

TL;DR

It is shown that VAEs undergo a transition in behaviour once the bottleneck size is larger than the ID of the data, manifesting in a double hunchback ID profile and a qualitative shift in information processing as captured by the II.

Abstract

This work presents an analysis of the hidden representations of Variational Autoencoders (VAEs) using the Intrinsic Dimension (ID) and the Information Imbalance (II). We show that VAEs undergo a transition in behaviour once the bottleneck size is larger than the ID of the data, manifesting in a double hunchback ID profile and a qualitative shift in information processing as captured by the II. Our results also highlight two distinct training phases for architectures with sufficiently large bottleneck sizes, consisting of a rapid fit and a slower generalisation, as assessed by a differentiated behaviour of ID, II, and KL loss. These insights demonstrate that II and ID could be valuable tools for aiding architecture search, for diagnosing underfitting in VAEs, and, more broadly, they contribute to advancing a unified understanding of deep generative models through geometric analysis.

Paper Structure

This paper contains 14 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: ID and II of trained architectures. Left) IDs for different bottleneck sizes $K$. Centre) IIs $\Delta(l \rightarrow l_{0})$ from layer ($l$) to the input ($l_0$) for different bottleneck sizes $K$. Right) Relative II difference between consecutive layers ($l$ and $l+1$) as a function of layer index $l$. In all panels, the quantities are graphed as a function of the layer index ($l$).
  • Figure 2: Losses and II during training. Left) FID loss on test set. Centre) KL Loss on test set. Right) II from last layer ($l_{10}$) to the input ($l_0$). In all panels, the quantities are graphed as a function of training epoch and for architectures of increasing bottleneck sizes (denoted by different colours).
  • Figure 3: ID during training. ID as a function of layer index for different epoch numbers for $K=64$.
  • Figure A1: II and FID, and IIs between different architectures. Left) II from the output to the input (left axis) and FID test loss (right axis) for increasing bottleneck sizes ('latent dimensions'). The panel clearly shows a transition in the nature of the output layer of VAEs as the bottleneck surpasses a critical value. This transition is not paralleled by an increase in the test error as measured by the FID loss. Centre and Right) The II curves for layer indices set to $l_{5}$ (bottleneck) and $l_{10}$ (output) and measuring the imbalances across different architecturs $\Delta(K \rightarrow 2K)$ and $\Delta(2K \rightarrow K)$ where $K$ is the size of the bottleneck. The two panels show that a transition is present in the nature of the bottleneck layer (layer 5) and output layer (layer 10) when the bottleneck size of VAEs surpasses a value of $K=16$.
  • Figure A2: II curves during training. II from every layer ($l$) to the input ($l_0$) as a function of the layer index for increasing epochs during training and for architectures with different bottleneck sizes (2, 8 and 128, here referred to as 'Dim.'). The figures show that before training (dashed black curve) the II is zero up to the bottleneck and one afterwards, indicating full information in the encoder and no information in the decoder. The figures also illustrate the difference in II before and after the transition at $K=16$ .
  • ...and 2 more figures