Table of Contents
Fetching ...

Object-Centric Relational Representations for Image Generation

Luca Butera, Andrea Cini, Alberto Ferrante, Cesare Alippi

TL;DR

This work presents GraPhOSE, a framework that conditions image generation on attributed pose graphs by encoding object structure and semantics into a graph-based representation and producing a learnable layout mask for a downstream decoder. It combines a graph-based encoder with a mask generator, trained with surrogate pre-training on procedurally generated graphs to regularize learning, and fine-tuned end-to-end on target tasks. A novel Pose-Representable Objects (PRO) synthetic benchmark and a real-world Humans dataset are used to demonstrate that relational, object-centric conditioning improves generation quality (lower FID, higher SSIM) and enables manipulation by graph editing. The approach offers flexible, scalable conditioning and regularization for generative models, with potential for broader application in structured scene synthesis and controllable generation.

Abstract

Conditioning image generation on specific features of the desired output is a key ingredient of modern generative models. However, existing approaches lack a general and unified way of representing structural and semantic conditioning at diverse granularity levels. This paper explores a novel method to condition image generation, based on object-centric relational representations. In particular, we propose a methodology to condition the generation of objects in an image on the attributed graph representing their structure and the associated semantic information. We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process and allow for regularizing the training procedure. The proposed conditioning framework is implemented by means of a neural network that learns to generate a 2D, multi-channel, layout mask of the objects, which can be used as a soft inductive bias in the downstream generative task. To do so, we leverage both 2D and graph convolutional operators. We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation. Empirical results show that the proposed approach compares favorably against relevant baselines.

Object-Centric Relational Representations for Image Generation

TL;DR

This work presents GraPhOSE, a framework that conditions image generation on attributed pose graphs by encoding object structure and semantics into a graph-based representation and producing a learnable layout mask for a downstream decoder. It combines a graph-based encoder with a mask generator, trained with surrogate pre-training on procedurally generated graphs to regularize learning, and fine-tuned end-to-end on target tasks. A novel Pose-Representable Objects (PRO) synthetic benchmark and a real-world Humans dataset are used to demonstrate that relational, object-centric conditioning improves generation quality (lower FID, higher SSIM) and enables manipulation by graph editing. The approach offers flexible, scalable conditioning and regularization for generative models, with potential for broader application in structured scene synthesis and controllable generation.

Abstract

Conditioning image generation on specific features of the desired output is a key ingredient of modern generative models. However, existing approaches lack a general and unified way of representing structural and semantic conditioning at diverse granularity levels. This paper explores a novel method to condition image generation, based on object-centric relational representations. In particular, we propose a methodology to condition the generation of objects in an image on the attributed graph representing their structure and the associated semantic information. We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process and allow for regularizing the training procedure. The proposed conditioning framework is implemented by means of a neural network that learns to generate a 2D, multi-channel, layout mask of the objects, which can be used as a soft inductive bias in the downstream generative task. To do so, we leverage both 2D and graph convolutional operators. We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation. Empirical results show that the proposed approach compares favorably against relevant baselines.
Paper Structure (30 sections, 13 equations, 15 figures, 12 tables)

This paper contains 30 sections, 13 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Our pipeline, with GraPhOSE in grey. $\mu_{\theta}$ gets pre-trained on surrogate masks. The downstream model, in yellow, can be any trainable generative model that accepts a $3$-d tensor as conditioning input. The whole pipeline can be trained end-to-end.
  • Figure 2: Surrogate mask for a random graph (left) and for a graph representing a person (right). Node positions in graph space are normalized between $0$ and $1$.
  • Figure 3: Sample of generated masks. For each one, the large figure is the aggregated mask, while the small ones are those associated with each node. The blue dots highlight the position of the accounted node. (a) is a random graph like those used for pre-training; (b) and (c) are simple handcrafted ones.
  • Figure 4: Masks generated by pre-training on random graphs (a) and on the Humans task's ones (b). For each group, the first two columns are samples from the Humans task, last two are random. In (b), performance clearly degrades out of distribution.
  • Figure 5: Sample results for PRO (left) and Humans (right) tasks. Row (a) is the input graph. Generated samples come from: the PRO exact renderer (b - left), the unconditioned downstream model (b - right), the FNN-based baseline (c), the GNN-based one (d), GraPhOSE without pre-training (e), GraPhOSE (f).
  • ...and 10 more figures