An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zhiyu Tan; Mengping Yang; Luozheng Qin; Hao Yang; Ye Qian; Qiang Zhou; Cheng Zhang; Hao Li

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou, Cheng Zhang, Hao Li

TL;DR

The paper tackles the limitations of CLIP-based text encoders in text-to-image diffusion, notably English-only support, a token limit of 77, and limited capacity. It presents OmniDiffusion, a three-stage pipeline that attaches a lightweight adapter to an LLM to produce text representations aligned with CLIP, enabling multilingual and long-context prompts for diffusion-based image synthesis. Through multilingual alignment (Stage 1), end-to-end text–image training on a large 43M dataset (Stage 2), and a high-aesthetic fine-tuning on 40K high-quality images (Stage 3), the approach achieves strong FID/CLIP-scores and higher aesthetic scores, with favorable human evaluations. This work provides a scalable, resource-efficient pathway to leverage LLMs in diffusion models, setting a practical baseline for multilingual and long-prompt text-to-image generation and guiding future multimodal integration efforts.

Abstract

One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing text-to-image model with LLMs. Specifically, we propose a lightweight adapter that enables fast training of the text-to-image model using the textual representations from LLMs. Extensive experiments demonstrate that our model supports not only multilingual but also longer input context with superior image generation quality.

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

TL;DR

Abstract

Paper Structure (17 sections, 9 equations, 15 figures, 9 tables)

This paper contains 17 sections, 9 equations, 15 figures, 9 tables.

Introduction
Related Work
Method
Preliminaries
Framework
Model Training Strategy
Experiments
Experimental Setup
Main Results
Ablation Study and Further Analysis
Conclusion
Appendix
Limitations
Future Works
More Analysis Results
...and 2 more sections

Figures (15)

Figure 1: Our proposed model could not only produce images with high visual quality given English input prompts (left), but also enables multilingual understanding capability for various language driven text-to-image generation (middle), as well as grasps much longer contextual information for generation (right).
Figure 2: Overall framework of our proposed method. The lightweight adapter efficiently connects LLMs and diffusion models, enhancing diffusion models with more powerful language understanding ability.
Figure 3: Three-stage training pipeline. Multilingual textual alignment enables LLM to connect visual and textual information in the CLIP embedding space. End-to-end text-image training explores the potential of LLM-derived textual features and improves generation quality. High-aesthetic finetuning further ameliorates the visual aesthetic.
Figure 4: Qualitative results of OmniDiffusion and competing methods. For models that do not support multilingual text conditions, we translate the given prompts into corresponding language to generate images. OmniDiffusion could produce images with accurate text-image alignment and higher visual quality.
Figure 5: Human evaluation results. Our model is consistently voted as the model that produce images with better visual quality.
...and 10 more figures

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

TL;DR

Abstract

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (15)