Table of Contents
Fetching ...

Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions

Sjoerd van Steenkiste, Michael Chang, Klaus Greff, Jürgen Schmidhuber

TL;DR

The paper tackles unsupervised learning of object-centered representations and their interactions from visual data. It introduces Neural Expectation Maximization (N-EM) to partition scenes into object components and Relational N-EM (R-NEM) to model pairwise interactions via an attention-weighted relational function. Across bouncing-ball experiments, occlusion, and Space Invaders, R-NEM outperforms non-relational baselines in both predictive accuracy and object-structure metrics, and demonstrates robust generalization to scenes with more objects. The work provides a step toward human-like, unsupervised world models that reason about objects and their dynamics, while acknowledging limitations and outlining directions for integrating top-down guidance and reinforcement learning.

Abstract

Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world. For example, it can be used to simulate the environment, or to infer the state of parts of the world that are currently unobserved. In order to match real-world conditions this causal knowledge must be learned without access to supervised data. To address this problem we present a novel method that learns to discover objects and model their physical interactions from raw visual images in a purely \emph{unsupervised} fashion. It incorporates prior knowledge about the compositional nature of human perception to factor interactions between object-pairs and learn efficiently. On videos of bouncing balls we show the superior modelling capabilities of our method compared to other unsupervised neural approaches that do not incorporate such prior knowledge. We demonstrate its ability to handle occlusion and show that it can extrapolate learned knowledge to scenes with different numbers of objects.

Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions

TL;DR

The paper tackles unsupervised learning of object-centered representations and their interactions from visual data. It introduces Neural Expectation Maximization (N-EM) to partition scenes into object components and Relational N-EM (R-NEM) to model pairwise interactions via an attention-weighted relational function. Across bouncing-ball experiments, occlusion, and Space Invaders, R-NEM outperforms non-relational baselines in both predictive accuracy and object-structure metrics, and demonstrates robust generalization to scenes with more objects. The work provides a step toward human-like, unsupervised world models that reason about objects and their dynamics, while acknowledging limitations and outlining directions for integrating top-down guidance and reinforcement learning.

Abstract

Common-sense physical reasoning is an essential ingredient for any intelligent agent operating in the real-world. For example, it can be used to simulate the environment, or to infer the state of parts of the world that are currently unobserved. In order to match real-world conditions this causal knowledge must be learned without access to supervised data. To address this problem we present a novel method that learns to discover objects and model their physical interactions from raw visual images in a purely \emph{unsupervised} fashion. It incorporates prior knowledge about the compositional nature of human perception to factor interactions between object-pairs and learn efficiently. On videos of bouncing balls we show the superior modelling capabilities of our method compared to other unsupervised neural approaches that do not incorporate such prior knowledge. We demonstrate its ability to handle occlusion and show that it can extrapolate learned knowledge to scenes with different numbers of objects.

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: Illustration of the different computational aspects of R-NEM when applied to a sequence of images of bouncing balls. Note that $\bm{\gamma}, \bm{\psi}$ at the Representations level correspond to the $\bm{\gamma}$ (E-step), $\bm{\psi}$ (Group Reconstructions) from the previous time-step. Different colors correspond to different cluster components (object representations).The right side shows a computational overview of $\Upsilon^{\text{R-NEM}}$, a function that computes the pair-wise interactions between the object representations.
  • Figure 2: R-NEM applied to a sequence of $4$ bouncing balls. Each column corresponds to a time-step, which coincides with an EM step. At each time-step, R-NEM computes $K=5$ new representations $\bm{\theta}_{k}$ according to \ref{['eq:recurrent_update']} (see also Representations in \ref{['fig:r-nem']}) from the input $\bm{x}$ with added noise (bottom row). From each new $\bm{\theta}_{k}$ a group reconstruction $\bm{\psi}_{k}$ is produced (rows 2-6 from bottom) that predicts the state of the environment at the next time-step. Attention coefficients are visualized by overlaying a colored reconstruction of a context object on the white reconstruction of the focus object (see Attention in \ref{['paragraph:attention']}). Based on the prediction accuracy of $\bm{\psi}$, the E-step (see \ref{['fig:r-nem']}) computes new soft-assignments $\bm{\gamma}$ (row 7 from bottom), visualized by coloring each pixel $i$ according to their distribution over components $\bm{\gamma}_{i}$. Row 8 visualizes the total prediction by the network ($\sum_{k}\bm{\psi}_{k} \cdot \bm{\gamma}_{k}$) and row 9 the ground-truth sequence at the next time-step.
  • Figure 3: Performance of each method on the bouncing balls task. Each method was trained on a dataset with 4 balls, evaluated on a test set with $4$ balls (left), and on a test-set with 6-8 balls (middle). The losses are reported relative to the loss of a baseline for each dataset that always predicts the current frame. The ARI score (right) is used to evaluate the degree of compositionality that is achieved.
  • Figure 4: Left: Three sequences of 15 time-steps ground-truth (top), R-NEM (middle), RNN (bottom). The last ten time-steps of the sequences produced by R-NEM and RNN are simulated. Right: The BCE loss on the entire test-set for these same time-steps.
  • Figure 5: R-NEM applied to a sequence of bouncing balls with an invisible curtain. The ground truth sequence is displayed in the top row, followed by the prediction of R-NEM (middle) and the soft-assignments of pixels to components (bottom). R-NEM models objects, as well as its interactions, even when the object is completely occluded (step 36). Only a subset of the steps is shown.
  • ...and 1 more figures