Table of Contents
Fetching ...

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

Weijia Wu, Zhuang Li, Yefei He, Mike Zheng Shou, Chunhua Shen, Lele Cheng, Yan Li, Tingting Gao, Di Zhang

TL;DR

ParaDiffusion introduces an information-enriched diffusion model for paragraph-to-image generation by coupling a decoder-only LLM (Llama V2) with LoRA-based adaptation and a three-stage training regimen. It builds ParaImage, a long-form caption dataset combining synthetic CogVLM-generated descriptions with a small manually annotated set to enable robust long-text alignment. Empirical results on ViLG-300 and ParaPrompts-400 show significant gains in text faithfulness and visual appeal over state-of-the-art baselines, with ablations validating the value of LLM adaptation, data scale, and high-quality tuning data. The work demonstrates that transferring long-text semantic understanding from LLMs to image generation is feasible and beneficial for complex, multi-object scenes, and releases code and data to foster further research.

Abstract

Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment.

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

TL;DR

ParaDiffusion introduces an information-enriched diffusion model for paragraph-to-image generation by coupling a decoder-only LLM (Llama V2) with LoRA-based adaptation and a three-stage training regimen. It builds ParaImage, a long-form caption dataset combining synthetic CogVLM-generated descriptions with a small manually annotated set to enable robust long-text alignment. Empirical results on ViLG-300 and ParaPrompts-400 show significant gains in text faithfulness and visual appeal over state-of-the-art baselines, with ablations validating the value of LLM adaptation, data scale, and high-quality tuning data. The work demonstrates that transferring long-text semantic understanding from LLMs to image generation is feasible and beneficial for complex, multi-object scenes, and releases code and data to foster further research.

Abstract

Text-to-image (T2I) models have recently experienced rapid development, achieving astonishing performance in terms of fidelity and textual alignment capabilities. However, given a long paragraph (up to 512 words), these generation models still struggle to achieve strong alignment and are unable to generate images depicting complex scenes. In this paper, we introduce an information-enriched diffusion model for paragraph-to-image generation task, termed ParaDiffusion, which delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation. At its core is using a large language model (e.g., Llama V2) to encode long-form text, followed by fine-tuning with LORA to alignthe text-image feature spaces in the generation task. To facilitate the training of long-text semantic alignment, we also curated a high-quality paragraph-image pair dataset, namely ParaImage. This dataset contains a small amount of high-quality, meticulously annotated data, and a large-scale synthetic dataset with long text descriptions being generated using a vision-language model. Experiments demonstrate that ParaDiffusion outperforms state-of-the-art models (SD XL, DeepFloyd IF) on ViLG-300 and ParaPrompts, achieving up to 15% and 45% human voting rate improvements for visual appeal and text faithfulness, respectively. The code and dataset will be released to foster community research on long-text alignment.
Paper Structure (35 sections, 2 equations, 21 figures, 11 tables)

This paper contains 35 sections, 2 equations, 21 figures, 11 tables.

Figures (21)

  • Figure 1: Examples of Paragraph-Image Alignment from ParaDiffusion. With the powerful semantic understanding capabilities of the LLM, ParaDiffusion is capable of generating highly aesthetic and sophisticated images, aligning well with long textual content.
  • Figure 2: Pipeline of Methodology. The training pipeline of ParaDiffusion mainly includes three stages: 1) Stage-1 for pretraining is based on 0.3 billion samples to acquire general text-image knowledge. 2) Stage-2 employ millions of data to simultaneously fine-tune LLM and the diffusion model for Paragraph-Image Alignment. 3) Quality tuning with curated high-quality annotated data (i.e., ParaImage-Small).
  • Figure 3: Examples of the proposed ParaImage dataset. (a) High-quality images with generative captions (ParaImage-Big) are primarily employed for the paragraph-image alignment learning in Stage 2. (b) Aesthetic images with manual long-term description (ParaImage-Small) are primarily used for quality-tuning in Stage 3.
  • Figure 4: Distribution of Caption Length. The textual descriptions of the proposed dataset (ParaImage) far exceed those of currently available public datasets.
  • Figure 5: Distribution of Caption Length for Different Evaluation Dataset. Our ParaPrompts dataset offers a high proportion of long-text descriptions.
  • ...and 16 more figures