TokenCompose: Text-to-Image Diffusion with Token-level Supervision
Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu
TL;DR
TokenCompose addresses the misalignment between prompts and image content in text-to-image diffusion, especially for prompts with multiple object categories. It introduces token-level and pixel-level grounding losses during finetuning of latent diffusion models, using noun-token segmentation maps generated by grounding models to enforce token-region consistency. The method finetunes Stable Diffusion without adding inference-time modules, yielding stronger multi-category instance composition and improved photorealism, as demonstrated on the new MultiGen benchmark and existing COCO/ADE datasets. The work highlights the benefit of cross-domain grounding signals for open-vocabulary generation and provides a resource for evaluating multi-category compositionality.
Abstract
We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images. Project link: https://mlpc-ucsd.github.io/TokenCompose
