Table of Contents
Fetching ...

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Alessandro Fontanella, Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Sarah Parisot

TL;DR

This work proposes a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity and demonstrates that this approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Abstract

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

TL;DR

This work proposes a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity and demonstrates that this approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Abstract

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

Paper Structure

This paper contains 26 sections, 3 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Overview of key components of our proposed methodology.
  • Figure 2: Our model can generalise to different styles and to follow detailed instructions. Top row: 'a cartoon style frog', 'a digital artwork of an anime-style character with long, flowing white hair and large and expressive purple eyes in a white attire', 'a stylised character with a traditional Asian hat, with a red and green pattern', 'a man with a contemplative expression and a neatly trimmed beard', Bottom row: 'a woman with a classic, vintage style, curly hair, red lipstick, fair skin in a dark attire', 'a bird mid-flight with brown and white feathers and orange head', 'a hand-painted ceramic vase in blue and yellow colours and with a floral pattern', 'a woman with short, blonde hair, vivid green eyes, in a white blouse, with a gold necklace featuring a pendant with a gemstone'.
  • Figure 3: Instances generated with the captions: 'a majestic brown bear with dark brown fur, its head slightly tilted to the left and its mouth slightly open', 'an Impressionist portrait of a woman', 'a portrait of a young man, depicted in a blend of blue and red tones'.
  • Figure 4: Our proposed training and sampling approaches (c) improve results obtained with standard training (a) and standard sampling (b).
  • Figure 5: Visual examples of scene composition results. RGBA instances are highlighted in bold.
  • ...and 9 more figures