Table of Contents
Fetching ...

Disentangling Factors of Variation via Generative Entangling

Guillaume Desjardins, Aaron Courville, Yoshua Bengio

TL;DR

The paper introduces a higher-order spike-and-slab Boltzmann machine (ssRBM) that uses four-way multiplicative interactions among latent groups to entangle and thereby disentangle factors of variation in data, trained fully with unsupervised maximum likelihood. Through a block-wise, multi-way pooling scheme and variational mean-field inference, the model learns to segregate factors such as emotion and identity without labels. Experiments on synthetic data and the Toronto Face Dataset show the approach can produce interpretable, disentangled representations and improve emotion-recognition performance relative to non-disentangled baselines, with competitive results compared to supervised and other unsupervised methods. The work highlights a path toward deep, layered disentangling by stacking such blocks, maintaining local coherence while progressively uncovering higher-level, nonlocal factors.

Abstract

Here we propose a novel model family with the objective of learning to disentangle the factors of variation in data. Our approach is based on the spike-and-slab restricted Boltzmann machine which we generalize to include higher-order interactions among multiple latent variables. Seen from a generative perspective, the multiplicative interactions emulates the entangling of factors of variation. Inference in the model can be seen as disentangling these generative factors. Unlike previous attempts at disentangling latent factors, the proposed model is trained using no supervised information regarding the latent factors. We apply our model to the task of facial expression classification.

Disentangling Factors of Variation via Generative Entangling

TL;DR

The paper introduces a higher-order spike-and-slab Boltzmann machine (ssRBM) that uses four-way multiplicative interactions among latent groups to entangle and thereby disentangle factors of variation in data, trained fully with unsupervised maximum likelihood. Through a block-wise, multi-way pooling scheme and variational mean-field inference, the model learns to segregate factors such as emotion and identity without labels. Experiments on synthetic data and the Toronto Face Dataset show the approach can produce interpretable, disentangled representations and improve emotion-recognition performance relative to non-disentangled baselines, with competitive results compared to supervised and other unsupervised methods. The work highlights a path toward deep, layered disentangling by stacking such blocks, maintaining local coherence while progressively uncovering higher-level, nonlocal factors.

Abstract

Here we propose a novel model family with the objective of learning to disentangle the factors of variation in data. Our approach is based on the spike-and-slab restricted Boltzmann machine which we generalize to include higher-order interactions among multiple latent variables. Seen from a generative perspective, the multiplicative interactions emulates the entangling of factors of variation. Inference in the model can be seen as disentangling these generative factors. Unlike previous attempts at disentangling latent factors, the proposed model is trained using no supervised information regarding the latent factors. We apply our model to the task of facial expression classification.

Paper Structure

This paper contains 13 sections, 10 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Energy function of our higher-order spike & slab RBM (ssRBM), used to disentangle (multiplicative) factors of variation in the data. Two groups of latent spike variables, $g$ and $h$, interact to explain the data $v$, through the weight tensor $W$. While the ssRBM instantiates a slab variable $s_j$ for each hidden unit $h_j$, our higher-order model employs a slab $s_{ij}$ for each pair of spike variables ($g_i$,$h_j$). $\mu_{ij}$ and $\alpha_{ij}$ are respectively the mean and precision parameters of $s_{ij}$. An additional set of spike variables $f$ are used to gate groups of latent variables $h$, $g$ and serve to promote group sparsity. Most parameters are thus indexed by an extra subscript $k$. Finally, $e$, $c$ and $d$ are standard bias terms for variables $f$, $g$ and $h$, while $\Lambda$ is a diagonal precision matrix on the visible vector.
  • Figure 2: Block-sparse connectivity pattern with dense interactions between $g$ and $h$ within each block (only shown for $k$-th block). Each block is gated by a separate $f_k$ variable.
  • Figure 3: (top) Samples from our synthetic dataset (before noise). In each image, a figure "X" can appear at five different positions, in one of eight basic colors. Objects in a given image must all be of the same color. (bottom) Filters learnt by a bilinear ssRBM with $M=3$, $N=5$, which succesfully show disentangling of color information (rows) from position (columns).
  • Figure 4: Example blocks obtained with $K=100$, $M=N=5$. The filters (inner-most dimension of tensor $W$) in each block exhibit global cohesion, specializing themselves to a subset of identities and emotions: {happiness, fear, neutral} in (left) and {happiness, anger} in (right). In both cases, $g$-units (which pool over columns) encode emotions, while $h$-units (which pool over rows) are more closely tied to identity.