Table of Contents
Fetching ...

Generating Intermediate Representations for Compositional Text-To-Image Generation

Ran Galun, Sagie Benaim

TL;DR

This work proposes a compositional approach for text-to-image generation based on two stages that can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

Abstract

Text-to-image diffusion models have demonstrated an impressive ability to produce high-quality outputs. However, they often struggle to accurately follow fine-grained spatial information in an input text. To this end, we propose a compositional approach for text-to-image generation based on two stages. In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations (such as depth or segmentation maps) conditioned on text. In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model. Our findings indicate that such compositional approach can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

Generating Intermediate Representations for Compositional Text-To-Image Generation

TL;DR

This work proposes a compositional approach for text-to-image generation based on two stages that can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

Abstract

Text-to-image diffusion models have demonstrated an impressive ability to produce high-quality outputs. However, they often struggle to accurately follow fine-grained spatial information in an input text. To this end, we propose a compositional approach for text-to-image generation based on two stages. In the first stage, we design a diffusion-based generative model to produce one or more aligned intermediate representations (such as depth or segmentation maps) conditioned on text. In the second stage, we map these representations, together with the text, to the final output image using a separate diffusion-based generative model. Our findings indicate that such compositional approach can improve image generation, resulting in a notable improvement in FID score and a comparable CLIP score, when compared to the standard non-compositional baseline.

Paper Structure

This paper contains 7 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: (a). Illustration of the full pipeline. In the first step, we generate aligned intermediate representation(s) given the input text. In the second stage, we use a pre-trained ControlNet to map the input text and the generated intermediate representation(s) to an output image. (b). Illustration of our alignment procedure. Given two pre-trained text-to-intermediate models (e.g., text-to-depth and text-to-segmentation), we interleave their spatial layers using "temporal" layers. The "temporal" layers consist of either a 3D convolution or a temporal attention layer and indicate the dimension on which the attention or convolution is performed. For clarity, we also provide each component's input and output dimensions. We note that only the temporal layers are trained in this stage.
  • Figure 2: (a). Results using a single intermediate representation (first six columns) and from original SD (last column). The generated intermediate representation is to the left of each output image. (b). As in (a), but using our aligned intermediate representations (depth & HED or depth & segmentation).