Table of Contents
Fetching ...

Discovering objects and their relations from entangled scene representations

David Raposo, Adam Santoro, David Barrett, Razvan Pascanu, Timothy Lillicrap, Peter Battaglia

TL;DR

The paper addresses how to learn object relations from entangled scene representations and proposes Relation Networks (RNs) that compute pairwise object relations with a shared function and permutation-invariant aggregation. The authors demonstrate that RNs excel at identifying relational structure, can induce factorized object representations from entangled inputs (including pixel-based inputs via a VAE), and can support one-shot relational learning when combined with a memory-augmented network. Key contributions include strong supervised relational reasoning performance, relational disentanglement from entangled inputs, and scalable integration with perceptual and memory modules for rapid generalization. The work suggests a broadly applicable, data-efficient architecture for object-relational reasoning across domains and modalities.

Abstract

Our world can be succinctly and compactly described as structured scenes of objects and relations. A typical room, for example, contains salient objects such as tables, chairs and books, and these objects typically relate to each other by their underlying causes and semantics. This gives rise to correlated features, such as position, function and shape. Humans exploit knowledge of objects and their relations for learning a wide spectrum of tasks, and more generally when learning the structure underlying observed data. In this work, we introduce relation networks (RNs) - a general purpose neural network architecture for object-relation reasoning. We show that RNs are capable of learning object relations from scene description data. Furthermore, we show that RNs can act as a bottleneck that induces the factorization of objects from entangled scene description inputs, and from distributed deep representations of scene images provided by a variational autoencoder. The model can also be used in conjunction with differentiable memory mechanisms for implicit relation discovery in one-shot learning tasks. Our results suggest that relation networks are a potentially powerful architecture for solving a variety of problems that require object relation reasoning.

Discovering objects and their relations from entangled scene representations

TL;DR

The paper addresses how to learn object relations from entangled scene representations and proposes Relation Networks (RNs) that compute pairwise object relations with a shared function and permutation-invariant aggregation. The authors demonstrate that RNs excel at identifying relational structure, can induce factorized object representations from entangled inputs (including pixel-based inputs via a VAE), and can support one-shot relational learning when combined with a memory-augmented network. Key contributions include strong supervised relational reasoning performance, relational disentanglement from entangled inputs, and scalable integration with perceptual and memory modules for rapid generalization. The work suggests a broadly applicable, data-efficient architecture for object-relational reasoning across domains and modalities.

Abstract

Our world can be succinctly and compactly described as structured scenes of objects and relations. A typical room, for example, contains salient objects such as tables, chairs and books, and these objects typically relate to each other by their underlying causes and semantics. This gives rise to correlated features, such as position, function and shape. Humans exploit knowledge of objects and their relations for learning a wide spectrum of tasks, and more generally when learning the structure underlying observed data. In this work, we introduce relation networks (RNs) - a general purpose neural network architecture for object-relation reasoning. We show that RNs are capable of learning object relations from scene description data. Furthermore, we show that RNs can act as a bottleneck that induces the factorization of objects from entangled scene description inputs, and from distributed deep representations of scene images provided by a variational autoencoder. The model can also be used in conjunction with differentiable memory mechanisms for implicit relation discovery in one-shot learning tasks. Our results suggest that relation networks are a potentially powerful architecture for solving a variety of problems that require object relation reasoning.

Paper Structure

This paper contains 17 sections, 1 equation, 11 figures.

Figures (11)

  • Figure 1: Model types. RNs are constructed to operate with an explicit prior on the input space (c). Features from all pairwise combinations of objects act as input to the same MLP, $g_{\psi}$.
  • Figure 2: Objects and relations. Relation types (column one) between object types (column two) can be described with directed graphs (column three). Shown in the fourth column are cropped clusters from example scenes generated by a model based on the directed graph shown in column three. In the last column are example of relations that can be used to inform class membership; for example, the distances between pairs of objects, or the differences in color between pairs of objects that may inform the particular graphical structure, and hence generative model, used to generate the scene.
  • Figure 3: Scene entangling. To test the ability of the RN to operate on entangled scene representations, we (a) multiplied a flattened vector representation of the scene description by a fixed permutation matrix $B$, or, (b) passed pixel-level representations of the scenes through a VAE and used the latent code as input to a RN with an additional linear layer.
  • Figure 4: Scene classification tasks. (a) RNs of various sizes (legend, inset) performed well when trained to classify 10 scene classes based on position relations, reaching a cross entropy loss below 0.01 (top panel), and on tasks that contained 5, 10 or 20 classes (bottom panel). The MLPs performed poorly regardless of network size and the number of classes. (b) When relational structure depended on the color of the objects (color task), all RN configurations performed well classifying 5, 10 or 20 classes, similar to what we observed on the position task. MLPs with similar number of parameters performed poorly.
  • Figure 5: Scene classification on withheld classes. RNs of different sizes (legend, inset) were trained to classify scenes from a pool of 490 classes. The plot shows the cross entropy loss on a test set composed of samples from 10 previously unseen classes.
  • ...and 6 more figures