Table of Contents
Fetching ...

ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks

Joël Küchler, Ellen van Maren, Vaiva Vasiliauskaitė, Katarina Vulić, Reza Abbasi-Asl, Stephan J. Ihle

TL;DR

This work presents ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead, and shows that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast.

Abstract

Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.

ORGAN: Object-Centric Representation Learning using Cycle Consistent Generative Adversarial Networks

TL;DR

This work presents ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead, and shows that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast.

Abstract

Although data generation is often straightforward, extracting information from data is more difficult. Object-centric representation learning can extract information from images in an unsupervised manner. It does so by segmenting an image into its subcomponents: the objects. Each object is then represented in a low-dimensional latent space that can be used for downstream processing. Object-centric representation learning is dominated by autoencoder architectures (AEs). Here, we present ORGAN, a novel approach for object-centric representation learning, which is based on cycle-consistent Generative Adversarial Networks instead. We show that it performs similarly to other state-of-the-art approaches on synthetic datasets, while at the same time being the only approach tested here capable of handling more challenging real-world datasets with many objects and low visual contrast. Complementing these results, ORGAN creates expressive latent space representations that allow for object manipulation. Finally, we show that ORGAN scales well both with respect to the number of objects and the size of the images, giving it a unique edge over current state-of-the-art approaches.
Paper Structure (21 sections, 4 equations, 5 figures, 1 table)

This paper contains 21 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Model architecture and training losses. ORGAN is based on the Cycle-Consistent Generative Adversarial Network (cycleGAN) architecture zhu2017unpaired, adapted to use an image (left) and a list (right) domain. Two GANs are used to transform between the two domains. By cycling through both generators, a cycle consist loss $\mathcal{L}_{Cyc_L}$ and $\mathcal{L}_{Cyc_I}$ can be enforced.
  • Figure 2: Network architectures used in this work. (A) List generator architecture, which transforms images into lists. First, the image is divided into patches and a score function is used, which predicts for each patch how likely it is to contain an object. After non-maximum suppression, a differentiable top-k operator detects k patches that contain objects. The feature network then extracts the object's features. (B) Image generator architecture, which transforms lists into images. First, the list is transformed into an image based on its location and features. For that, each feature vector $\eta_i$ is projected ($\zeta$) onto a sphere, where the length of the projected vectors encodes $\alpha$. Then, a similar approach as presented in BlobGAN is used epstein2022blobgan to convert the projections into an image. After adding a noise channel to allow for richer backgrounds, a U-net transforms the style of the image.
  • Figure 3: Examples of object detection performance for four different datasets. We show performance for ORGAN (ours), SPACE lin2020space, SLATE singh2021illiterate, LSD jiang2023lsd and SPOT kakogeorgiou2024spot. We present both the reconstruction as well as the object detection in the form of fixed bounding boxes for ORGAN and SPACE. The color code of the bounding box represents the extracted feature vector. For SLATE, LSD, and SPOT we present the recovered slot masks (color-coded). SPOT does not allow for image reconstruction.
  • Figure 4: Latent space quality of ORGAN. (A) The latent space as encoded by $\eta$ is presented for the Sprites dataset. Two of the three dimensions are plotted. The third dimension is kept fixed at its center. The feature space encodes the color, shape, and size of the object. (B) An input image of the Tetrominoes dataset is cycled, while the list elements are modified. Top: All objects were moved closer to the center. Bottom: The object properties were swapped between the three objects counterclockwise. (C) To compare how well object clustering works on the feature space, the Davies–Bouldin Index davies1979cluster was calculated for the two tested approaches, which detected distinct objects for Sprites and Tetrominoes; ORGAN and SPACE. Our method achieves superior separability.
  • Figure 5: Analysis of the generalization capabilities of the object detection accuracy. A Recall and precision are plotted for both ORGAN and SPACE for Sprites. The architecture of SPACE can only be applied to a fixed input size (here $256 \times 256$). ORGAN was trained on $128 \times 128$ and applied to $256 \times 256$. This approach benefits SPACE. Yet, ORGAN outperforms in precision, while SPACE has a higher recall. (B) ORGAN can also generalize to much larger images. Trained on patches of size $128 \times 128$, prediction results on a $768 \times 768$ image of the Cells dataset are presented. The color code of the bounding box illustrates differences in extracted features $\eta$.