Table of Contents
Fetching ...

TokenCompose: Text-to-Image Diffusion with Token-level Supervision

Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu

TL;DR

TokenCompose addresses the misalignment between prompts and image content in text-to-image diffusion, especially for prompts with multiple object categories. It introduces token-level and pixel-level grounding losses during finetuning of latent diffusion models, using noun-token segmentation maps generated by grounding models to enforce token-region consistency. The method finetunes Stable Diffusion without adding inference-time modules, yielding stronger multi-category instance composition and improved photorealism, as demonstrated on the new MultiGen benchmark and existing COCO/ADE datasets. The work highlights the benefit of cross-domain grounding signals for open-vocabulary generation and provides a resource for evaluating multi-category compositionality.

Abstract

We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images. Project link: https://mlpc-ucsd.github.io/TokenCompose

TokenCompose: Text-to-Image Diffusion with Token-level Supervision

TL;DR

TokenCompose addresses the misalignment between prompts and image content in text-to-image diffusion, especially for prompts with multiple object categories. It introduces token-level and pixel-level grounding losses during finetuning of latent diffusion models, using noun-token segmentation maps generated by grounding models to enforce token-region consistency. The method finetunes Stable Diffusion without adding inference-time modules, yielding stronger multi-category instance composition and improved photorealism, as demonstrated on the new MultiGen benchmark and existing COCO/ADE datasets. The work highlights the benefit of cross-domain grounding signals for open-vocabulary generation and provides a resource for evaluating multi-category compositionality.

Abstract

We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images. Project link: https://mlpc-ucsd.github.io/TokenCompose
Paper Structure (21 sections, 7 equations, 17 figures, 7 tables)

This paper contains 21 sections, 7 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Given a user-specified text prompt consisting of object compositions that are unlikely to appear simultaneously in a natural scene, our proposed TokenCompose method attains significant performance enhancement over the baseline Latent Diffusion Model (e.g., Stable Diffusion ldm) by being able to generate multiple categories of instances from the prompt more accurately.
  • Figure 2: An overview of theTokenComposetraining pipeline. Given a training prompt that faithfully describes an image, we adopt a POS tagger flair and Grounded SAM samgrounding_dino to extract all binary segmentation maps of the image corresponding to noun tokens from the prompt. Then, we jointly optimize the denoising U-Net of the diffusion model with both its original denoising and our grounding objective.
  • Figure 3: Illustration of $\mathcal{L}_\text{token}$ and $\mathcal{L}_\text{pixel}$. We illustrate how $\mathcal{L}_\text{token}$ and $\mathcal{L}_\text{pixel}$ are calculated given a cross-attention map $\mathcal{A}_i$ and a binary segmentation mask $\mathcal{M}_i$. $\mathcal{L}_\text{token}$ aggregates attention activations toward non-masked regions of $\mathcal{M}_i$, and this objective is normalized by the total activations of $\mathcal{A}_i$. However, it does not constrain where activations should be once inside the non-masked region. $\mathcal{L}_\text{pixel}$ gives precise supervision whether a pixel belongs to the segmented region, constraining where activations should be with binary values. However, it is not normalized by the total activations of $\mathcal{A}_i$. Combining $\mathcal{L}_\text{token}$ and $\mathcal{L}_\text{pixel}$, we take advantage of the benefit of each objective while minimizing their side effects to a minimum level. We show examples of cross-attention activations from models optimized with $\mathcal{L}_\text{token}$ and $\mathcal{L}_\text{pixel}$, either of them, and neither of them in Figure \ref{['fig:activation_viz']}.
  • Figure 4: Impact on cross-attention activations with different objectives. We firstly demonstrate that finetuning the Stable Diffusion with only $\mathcal{L}_{LDM}$ does not improve grounding capabilities as much. Adding $\mathcal{L_\text{pixel}}$ alone causes increased cross-attention activations in general. Adding $\mathcal{L_\text{token}}$ plays a vital role in improving token grounding, but leads activations to aggregate in subregions of the targets. By combining $\mathcal{L_\text{token}}$ and $\mathcal{L_\text{pixel}}$, the model shows substantial improvement in grounding text tokens with image features. In this illustration, we apply the null text inversion null_text_inversion technique to all models, allowing them to generate the same image for comparable cross-attention maps.
  • Figure 5: Qualitative comparison between baselines and our model. We demonstrate the effectiveness of our training framework in multi-category instance composition compared with a frozen Stable Diffusion Model ldm, Composable Diffusioncomposable_diffusion, Structured Diffusionstructured_diffusion, Layout Guidance Diffusionlayout_diffusion, and Attend-and-Exciteattend_and_excite. The first three columns show composition of two categories that is deemed difficult to be generated from a pretrained Stable Diffusion model (due to rare chances of co-occurrence or significant difference in instance sizes in the real world). The last three columns show the composition of three categories where composing them requires understanding of visual representations of each text token.
  • ...and 12 more figures