Table of Contents
Fetching ...

Structure by Architecture: Structured Representations without Regularization

Felix Leeb, Guilia Lanzillotta, Yashas Annadani, Michel Besserve, Stefan Bauer, Bernhard Schölkopf

TL;DR

This work designs a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization, and demonstrates how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation.

Abstract

We study the problem of self-supervised structured representation learning using autoencoders for downstream tasks such as generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance typically observed in VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, thereby ordering the information without any additional regularization or supervision. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.

Structure by Architecture: Structured Representations without Regularization

TL;DR

This work designs a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization, and demonstrates how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation.

Abstract

We study the problem of self-supervised structured representation learning using autoencoders for downstream tasks such as generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance typically observed in VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, thereby ordering the information without any additional regularization or supervision. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.

Paper Structure

This paper contains 26 sections, 2 equations, 25 figures, 4 tables.

Figures (25)

  • Figure 1: The structural decoder reconstructs (or generates) a sample from a latent vector $U$ by first splitting $U$ into $d$ variables each of which infuses latent information with an affine transforms of the pixel $v_l^{hw}$ in image feature map $S_i$ produced by a Str-Tfm layer (green box where $\alpha_i$ and $\beta_i$ are the affine parameters that are extracted from the latent variable $U_i$ by network $\mathrm{MLP}_i$).
  • Figure 2: Reconstruction quality for all models and datasets (lower is better). "Baseline" models correspond to traditional "hourglass" CNN architectures, while the "Structural" models use our novel architectures to further structure the learned representation.
  • Figure 3: Quality of the generated samples using different models and sampling methods (lower is better). Note that our SAE models perform well without having to regularize the latent space towards a prior. In fact, even with the conventional "hourglass" architecture (in orange), the hybrid sampling method generates relatively high quality samples, often outperforming the more principled prior-based sampling.
  • Figure 4: Latent traversals of several models trained on 3D-Shapes, in their original order. Note the ordering of the information in the structural decoder models (SAE-12 and SAE-3) where higher level, nonlinear features (like shape and orientation) are encoded in the first few dimensions, which feed into Str-Tfm layers deeper in the network. Somewhat surprisingly, the SAE-12 even learns to compact the representation by ignoring superfluous latent variables (e.g. first row) resembling the effect of posterior collapse in VAEs (see \ref{['sec:indep']}).
  • Figure 5: Disentanglement scores for 3D-Shapes. DCI denotes the DCI disentanglement score eastwood2018framework, MIG is the Mutual Information Gap chen2018isolating, IRS is the Interventional Robustness Score suter2018interventional, and Mod/Exp refers to the Modularity/Explicitness scores respectively ridgeway2018learning (for all these metrics higher is better). The figure on the right shows how the scores vary across five models with different random seeds marked with a cross (lines indicate the resulting mean and standard deviation). Both hierarchical methods, SAE-12 and the VLAE-12, outperform all other baselines, and in particular the SAE performs well, despite the lack of regularization.
  • ...and 20 more figures