Table of Contents
Fetching ...

Tagger: Deep Unsupervised Perceptual Grouping

Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, Jürgen Schmidhuber, Harri Valpola

TL;DR

TAG addresses perceptual grouping in multi-object inputs by learning groupings and representations unsupervised, or alongside supervised tasks. The iTerative Amortized Grouping framework partitions inputs into K groups and uses a shared parametric mapping to iteratively refine group assignments and object representations, with a denoising objective enabling amortized inference. The Tagger combines TAG with a Ladder network, enabling efficient inference and improving performance in synthetic shapes and textured MNIST datasets, including substantial gains in semi-supervised learning. The work demonstrates fast convergence, domain-agnostic applicability, and potential for scaling to more complex multi-object scenarios.

Abstract

We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different objects in an iterative manner. By allowing the system to amortize the iterative inference of the groupings, we achieve very fast convergence. In contrast to many other recently proposed methods for addressing multi-object scenes, our system does not assume the inputs to be images and can therefore directly handle other modalities. For multi-digit classification of very cluttered images that require texture segmentation, our method offers improved classification performance over convolutional networks despite being fully connected. Furthermore, we observe that our system greatly improves on the semi-supervised result of a baseline Ladder network on our dataset, indicating that segmentation can also improve sample efficiency.

Tagger: Deep Unsupervised Perceptual Grouping

TL;DR

TAG addresses perceptual grouping in multi-object inputs by learning groupings and representations unsupervised, or alongside supervised tasks. The iTerative Amortized Grouping framework partitions inputs into K groups and uses a shared parametric mapping to iteratively refine group assignments and object representations, with a denoising objective enabling amortized inference. The Tagger combines TAG with a Ladder network, enabling efficient inference and improving performance in synthetic shapes and textured MNIST datasets, including substantial gains in semi-supervised learning. The work demonstrates fast convergence, domain-agnostic applicability, and potential for scaling to more complex multi-object scenarios.

Abstract

We present a framework for efficient perceptual inference that explicitly reasons about the segmentation of its inputs and features. Rather than being trained for any specific segmentation, our framework learns the grouping process in an unsupervised manner or alongside any supervised task. By enriching the representations of a neural network, we enable it to group the representations of different objects in an iterative manner. By allowing the system to amortize the iterative inference of the groupings, we achieve very fast convergence. In contrast to many other recently proposed methods for addressing multi-object scenes, our system does not assume the inputs to be images and can therefore directly handle other modalities. For multi-digit classification of very cluttered images that require texture segmentation, our method offers improved classification performance over convolutional networks despite being fully connected. Furthermore, we observe that our system greatly improves on the semi-supervised result of a baseline Ladder network on our dataset, indicating that segmentation can also improve sample efficiency.

Paper Structure

This paper contains 36 sections, 25 equations, 6 figures, 2 tables, 3 algorithms.

Figures (6)

  • Figure 1: An example of perceptual grouping for vision.
  • Figure 2: Illustration of the TAG framework used for training. Left: The system learns by denoising its input over iterations using several groups to distribute the representation. Each group, represented by several panels of the same color, maintains its own estimate of reconstructions $\bm{z}^i$ of the input, and corresponding masks $\bm{m}^i$, which encode the parts of the input that this group is responsible for representing. These estimates are updated over iterations by the same network, that is, each group and iteration share the weights of the network and only the inputs to the network differ. In the case of images, $\bm{z}$ contains pixel-values. Right: In each iteration $\bm{z}^{i-1}$ and $\bm{m}^{i-1}$ from the previous iteration, are used to compute a likelihood term $L(\bm{m}^{i-1})$ and modeling error $\delta \bm{z}^{i-1}$. These four quantities are fed to the parametric mapping to produce $\bm{z}^i$ and $\bm{m}^i$ for the next iteration. During learning, all inputs to the network are derived from the corrupted input as shown here. The unsupervised task for the network is to learn to denoise, i.e. output an estimate $q(\bm{x})$ of the original clean input. See \ref{['sec:synthesis']} for more details.
  • Figure 3: Pseudocode for running Tagger on a single real-valued example $\bm{x}$. For details and a binary-input version please refer to supplementary material.
  • Figure 4: Results for Shapes dataset. Left column: 7 examples from the test set along with their resulting groupings in descending AMI score order and 3 hand-picked examples (A, B, and C) to demonstrate generalization. A: Testing 2-group model on 3 object data. B: Testing a 4-group model trained with 3-object data on 4 objects. C: Testing 4-group model trained with 3-object data on 2 objects. Right column: Illustration of the inference process over iterations for four color-coded groups; $\bf{m}_g$ and $\bf{z}_g$.
  • Figure 5: Results for the TextureMNIST2 dataset. Left column: 7 examples from the test set along with their resulting groupings in descending AMI score order and 3 hand-picked examples (D, E1, E2). D: An example from the TextureMNIST1 dataset. E1-2: A hand-picked example from TextureMNIST2. E1 demonstrates typical inference, and E2 demonstrates how the system is able to estimate the input when a certain group (topmost digit 4) is removed. Right column: Illustration of the inference process over iterations for four color-coded groups; $\bf{m}_g$ and $\bf{z}_g$.
  • ...and 1 more figures