Table of Contents
Fetching ...

SCAN: Learning Hierarchical Compositional Visual Concepts

Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher P Burgess, Matko Bosnjak, Murray Shanahan, Matthew Botvinick, Demis Hassabis, Alexander Lerchner

TL;DR

SCAN addresses the challenge of learning grounded, hierarchical visual concepts with minimal supervision by grounding a symbolic concept space to a disentangled visual primitive space learned via a β-VAE with a denoising autoencoder loss. It enables bidirectional image-symbol inference and introduces recombination operators (AND, IN COMMON, IGNORE) implemented through a conditional convolutional module to traverse and expand the implicit concept hierarchy. The approach demonstrates strong performance on DeepMind Lab and CelebA, surpassing baselines in both accuracy and diversity, and shows capability to imagine novel concepts beyond training data. The work suggests broad applicability to reinforcement learning, planning, and robust concept-based perception, thanks to its sample efficiency and flexible symbol representations.

Abstract

The seemingly infinite diversity of the natural world arises from a relatively small set of coherent rules, such as the laws of physics or chemistry. We conjecture that these rules give rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts. If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts. This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such abstractions in the visual domain. SCAN learns concepts through fast symbol association, grounding them in disentangled visual primitives that are discovered in an unsupervised manner. Unlike state of the art multimodal generative model baselines, our approach requires very few pairings between symbols and images and makes no assumptions about the form of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to break away from its training data distribution and imagine novel visual concepts through symbolically instructed recombination of previously learnt concepts.

SCAN: Learning Hierarchical Compositional Visual Concepts

TL;DR

SCAN addresses the challenge of learning grounded, hierarchical visual concepts with minimal supervision by grounding a symbolic concept space to a disentangled visual primitive space learned via a β-VAE with a denoising autoencoder loss. It enables bidirectional image-symbol inference and introduces recombination operators (AND, IN COMMON, IGNORE) implemented through a conditional convolutional module to traverse and expand the implicit concept hierarchy. The approach demonstrates strong performance on DeepMind Lab and CelebA, surpassing baselines in both accuracy and diversity, and shows capability to imagine novel concepts beyond training data. The work suggests broad applicability to reinforcement learning, planning, and robust concept-based perception, thanks to its sample efficiency and flexible symbol representations.

Abstract

The seemingly infinite diversity of the natural world arises from a relatively small set of coherent rules, such as the laws of physics or chemistry. We conjecture that these rules give rise to regularities that can be discovered through primarily unsupervised experiences and represented as abstract concepts. If such representations are compositional and hierarchical, they can be recombined into an exponentially large set of new concepts. This paper describes SCAN (Symbol-Concept Association Network), a new framework for learning such abstractions in the visual domain. SCAN learns concepts through fast symbol association, grounding them in disentangled visual primitives that are discovered in an unsupervised manner. Unlike state of the art multimodal generative model baselines, our approach requires very few pairings between symbols and images and makes no assumptions about the form of symbol representations. Once trained, SCAN is capable of multimodal bi-directional inference, generating a diverse set of image samples from symbolic descriptions and vice versa. It also allows for traversal and manipulation of the implicit hierarchy of visual concepts through symbolic instructions and learnt logical recombination operations. Such manipulations enable SCAN to break away from its training data distribution and imagine novel visual concepts through symbolically instructed recombination of previously learnt concepts.

Paper Structure

This paper contains 36 sections, 7 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Schematic of an implicit concept hierarchy built upon a subset of four visual primitives: object identity ($I$), object colour ($O$), floor colour ($F$) and wall colour ($W$) (other visual primitives necessary to generate the scene are ignored in this example). Concepts form an implicit hierarchy, where each parent is an abstraction over its children and over the original set of visual primitives (the values of the concept-defining sets of visual primitives are indicated by the bold capital letters). In order to generate an image that corresponds to a concept, one has to fill in values for the factors that got abstracted away (represented as "_"), e.g. by sampling from their respective priors. Given certain nodes in the concept hierarchy, one can traverse the other nodes using logical operations. See Sec.\ref{['sec_formalising']} for our formal definition of concepts.
  • Figure 2: A: SCAN model architecture. The capital letters correspond to four disentangled visual primitives: object identity ($I$), object colour ($O$), floor colour ($F$) and wall colour ($W$). B: Mode coverage of the extra KL term of the SCAN loss function. Forward KL divergence $D_{KL}\divx{\mathbf{z}_x}{\mathbf{z}_y}$ allows SCAN to learn abstractions (wide yellow distribution $\mathbf{z}_y$) over the visual primitives that are irrelevant to the meaning of a concept (blue modes corresponds to the inferred values of $\mathbf{z}_x$ for different visual examples matching symbol $\mathbf{y}$). C: $\beta\text{-VAE}_{DAE}$ model architecture.
  • Figure 3: A: Learning AND, IN COMMON or IGNORE recombination operators with a SCAN model architecture. Inset demonstrates the convolutional recombination operator that takes in $\{\mu_{y_1}^k, \sigma_{y_1}^k; \mu_{y_2}^k, \sigma_{y_2}^k \}$ and outputs $\{\mu_r^k, \sigma_r^k \}$. The capital letters correspond to four disentangled visual primitives: object identity ($I$), object colour ($O$), floor colour ($F$) and wall colour ($W$). B: Visual samples produced by SCAN and JMVAE when instructed with a novel concept recombination. SCAN samples consistently match the expected ground truth recombined concept, while maintaining high variability in the irrelevant visual primitives. JMVAE samples lack accuracy. Recombination instructions are used to imagine concepts that have never been seen during model training. Top: samples for IGNORE; Middle: samples for IN COMMON; Bottom: samples for AND.
  • Figure 4: A: sym2img inferences with "white suitcase", "white suitcase, blue wall", and "white suitcase, blue wall, magenta floor" as input. The latter one points to a concept that the model has never seen during training, either visually or symbolically. All samples are consistently accurate, while showing good diversity in terms of the irrelevant visual attributes. B: when presented with an image, SCAN is able to describe it in terms of all concepts it has learnt, including synonyms (e.g. "dub", which corresponds to {ice lolly, white wall}). The histograms show the distributions of unique concepts the model used to describe each image, most probable of which are printed in descending order next to the corresponding image. The few confusions SCAN makes are intuitive to humans too (e.g. confusing orange and yellow colours).
  • Figure 5: Evolution of understanding of the meaning of concept {cyan wall} as SCAN is exposed to progressively more diverse visual examples. Left: top row contains three sets of visual samples (sym2img) generated by SCAN after seeing each set of five visual examples presented in the bottom row. Right: average inferred specificity of concept latents $z_y^k$ during training. Vertical dashed lines correspond to the vertical dashed lines in the left plot and indicate a switch to the next set of five more diverse visual examples. 6/32 latents $z_y^k$ and labelled according to their corresponding visual primitives in $\mathbf{z}_x$.
  • ...and 15 more figures