Table of Contents
Fetching ...

Early Visual Concept Learning with Unsupervised Deep Learning

Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, Alexander Lerchner

TL;DR

This work tackles the challenge of learning disentangled factors from raw images in an unsupervised way. By imposing neuroscience-inspired constraints—continuous data transformations, redundancy reduction, and statistical independence—within a variational autoencoder framework, it demonstrates reliable disentanglement of continuous generative factors and shows zero-shot generalization to unseen factor combinations. The approach yields emergent properties such as reasoning about objectness and robust transfer capabilities, even without supervision, and is validated across multiple synthetic and real-world datasets. The findings suggest that unsupervised pre-training with disentangled representations can enhance transfer, fast learning, and robust reasoning in downstream tasks.

Abstract

Automated discovery of early visual concepts from raw image data is a major open challenge in AI research. Addressing this problem, we propose an unsupervised approach for learning disentangled representations of the underlying factors of variation. We draw inspiration from neuroscience, and show how this can be achieved in an unsupervised generative model by applying the same learning pressures as have been suggested to act in the ventral visual stream in the brain. By enforcing redundancy reduction, encouraging statistical independence, and exposure to data with transform continuities analogous to those to which human infants are exposed, we obtain a variational autoencoder (VAE) framework capable of learning disentangled factors. Our approach makes few assumptions and works well across a wide variety of datasets. Furthermore, our solution has useful emergent properties, such as zero-shot inference and an intuitive understanding of "objectness".

Early Visual Concept Learning with Unsupervised Deep Learning

TL;DR

This work tackles the challenge of learning disentangled factors from raw images in an unsupervised way. By imposing neuroscience-inspired constraints—continuous data transformations, redundancy reduction, and statistical independence—within a variational autoencoder framework, it demonstrates reliable disentanglement of continuous generative factors and shows zero-shot generalization to unseen factor combinations. The approach yields emergent properties such as reasoning about objectness and robust transfer capabilities, even without supervision, and is validated across multiple synthetic and real-world datasets. The findings suggest that unsupervised pre-training with disentangled representations can enhance transfer, fast learning, and robust reasoning in downstream tasks.

Abstract

Automated discovery of early visual concepts from raw image data is a major open challenge in AI research. Addressing this problem, we propose an unsupervised approach for learning disentangled representations of the underlying factors of variation. We draw inspiration from neuroscience, and show how this can be achieved in an unsupervised generative model by applying the same learning pressures as have been suggested to act in the ventral visual stream in the brain. By enforcing redundancy reduction, encouraging statistical independence, and exposure to data with transform continuities analogous to those to which human infants are exposed, we obtain a variational autoencoder (VAE) framework capable of learning disentangled factors. Our approach makes few assumptions and works well across a wide variety of datasets. Furthermore, our solution has useful emergent properties, such as zero-shot inference and an intuitive understanding of "objectness".

Paper Structure

This paper contains 25 sections, 3 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: A: Disentangled representations of data generative factors allow for fast knoweldge transfer between different reinforcement learning (RL) policies. State of the art RL models without such representations (e.g. DQN by Mnih_etal_2015) require complete re-learning of low-level features for different tasks Lake_etal_2016. B: Models are unable to generalise to data outside of the convex hull of the training distribution (light blue line) unless they learn about the data generative factors and recombine them in novel ways. C: Sparse data points do not provide enough information for an unsupervised model to identify where the data manifold should lie. Data generated using factors densely sampled from continuous distributions makes manifold learning less ambiguous.
  • Figure 2: A: Disentangled representation learnt with $\beta=4$. Each column represents a latent $z_i$, ordered according to the learnt Gaussian variance (last row). Row 1 (position) shows the mean activation (red represents high values) of each latent $z_i$ as a function of all 32x32 locations averaged across objects, rotations and scales. Row 2 (scale) shows the mean activation of each unit $z_i$ as a function of scale (averaged across rotations and positions). Row 3 (rotation) shows the mean activation of each unit $z_i$ as a function of rotation (averaged across scales and positions). Square is red, oval is green and heart is blue. Rows 4-8 (second group) show reconstructions resulting from the traversal of each latent $z_i$ over three standard deviations around the unit Gaussian prior mean while keeping the remaining 9/10 latent units fixed to the values obtained by running inference on an image from the dataset. After learning, five latents learnt to represent the generative factors of the data, while the others converged to the uninformative unit Gaussian prior. B: Similar analysis for an entangled representation learnt with $\beta=0$.
  • Figure 3: A: Factor change classification accuracy for the original 2D shapes dataset (heart, oval and square). Ground truth uses data generating vectors $v$. PCA and ICA decompositions keep the first ten components (PCA components explain $60.8\%$ of variance). Untrained refers to a VAE with random weights. Disentangled is a VAE with $\beta=4$. Entangled uses either $\beta=0$ (maximum likelihood solution) or $\beta=1$ (Bayes solution). B: "Zero-shot Understanding" refers to a VAE that did not see particular combinations of the generative factors during training (see Sec. \ref{['sec_zsl']}), but had to reason about them during factor change classification. A projection of the hypercube formed by the data generative factors is visualised on the right. Only the yellow subset was used for training. The held out factor combinations are shown in grey and were used to evaluate the factor change classification accuracy.
  • Figure 4: A: Negative correlation between data transform continuity and the degree of disentangling achieved by VAEs. Abscissa is the average normalized Hamming distance between each of the two consecutive transforms of each object. Ordinate is factor change classification accuracy from Sec. \ref{['sec_quant']}. Disentangling performance is robust to Bernoulli noise added to the data at test time, as shown by slowly degrading classification accuracy up to 10% noise level, considering that the 2D objects occupy on average between 2-7% of the image depending on scale. Fluctuations in classification accuracy for similar Hamming distances are due the different nature of subsampled generative factors (i.e. symmetries are present in rotation but are lacking in position). B: Positive correlation is present between the size of $\mathbf{z}$ and the optimal normalised values of $\beta$ for disentangled factor learning for a fixed VAE architecture. $\beta$ values are normalised by latent $\mathbf{z}$ size $m$ and input $\mathbf{x}$ size $n$. Note that $\beta$ values are not uniformly sampled. Good reconstructions are associated with entangled representations (lower disentanglement scores). Orange approximately corresponds to unnormalised$\beta=1$. Disentangled representations (high disentanglement scores) often result in blurry reconstructions.
  • Figure 5: A: Amoeba object with four arms of varying length. B: Two non-linear generative factors determine the lengths of the pairwise grouped arms. Traversal over three standard deviations around the unit Gaussian prior mean for two latent units ($z_1$ and $z_2$) that learnt disentangled representations of the two generative factors. $z_1$ learnt the sigmoidal factor. $z_2$ learnt the quadratic factor. $z_3$ learnt to be a switch that determines which half of the quadratic factor is traversed by $z_2$.
  • ...and 2 more figures