Object-Centric Relational Representations for Image Generation
Luca Butera, Andrea Cini, Alberto Ferrante, Cesare Alippi
TL;DR
This work presents GraPhOSE, a framework that conditions image generation on attributed pose graphs by encoding object structure and semantics into a graph-based representation and producing a learnable layout mask for a downstream decoder. It combines a graph-based encoder with a mask generator, trained with surrogate pre-training on procedurally generated graphs to regularize learning, and fine-tuned end-to-end on target tasks. A novel Pose-Representable Objects (PRO) synthetic benchmark and a real-world Humans dataset are used to demonstrate that relational, object-centric conditioning improves generation quality (lower FID, higher SSIM) and enables manipulation by graph editing. The approach offers flexible, scalable conditioning and regularization for generative models, with potential for broader application in structured scene synthesis and controllable generation.
Abstract
Conditioning image generation on specific features of the desired output is a key ingredient of modern generative models. However, existing approaches lack a general and unified way of representing structural and semantic conditioning at diverse granularity levels. This paper explores a novel method to condition image generation, based on object-centric relational representations. In particular, we propose a methodology to condition the generation of objects in an image on the attributed graph representing their structure and the associated semantic information. We show that such architectural biases entail properties that facilitate the manipulation and conditioning of the generative process and allow for regularizing the training procedure. The proposed conditioning framework is implemented by means of a neural network that learns to generate a 2D, multi-channel, layout mask of the objects, which can be used as a soft inductive bias in the downstream generative task. To do so, we leverage both 2D and graph convolutional operators. We also propose a novel benchmark for image generation consisting of a synthetic dataset of images paired with their relational representation. Empirical results show that the proposed approach compares favorably against relevant baselines.
