Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Shihao Zhao; Shaozhe Hao; Bojia Zi; Huaizhe Xu; Kwan-Yee K. Wong

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Shihao Zhao, Shaozhe Hao, Bojia Zi, Huaizhe Xu, Kwan-Yee K. Wong

TL;DR

This paper proposes LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation and demonstrates that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality.

Abstract

Text-to-image generation has made significant advancements with the introduction of text-to-image diffusion models. These models typically consist of a language model that interprets user prompts and a vision model that generates corresponding images. As language and vision models continue to progress in their respective domains, there is a great potential in exploring the replacement of components in text-to-image diffusion models with more advanced counterparts. A broader research objective would therefore be to investigate the integration of any two unrelated language and generative vision models for text-to-image generation. In this paper, we explore this objective and propose LaVi-Bridge, a pipeline that enables the integration of diverse pre-trained language models and generative vision models for text-to-image generation. By leveraging LoRA and adapters, LaVi-Bridge offers a flexible and plug-and-play approach without requiring modifications to the original weights of the language and vision models. Our pipeline is compatible with various language models and generative vision models, accommodating different structures. Within this framework, we demonstrate that incorporating superior modules, such as more advanced language models or generative vision models, results in notable improvements in capabilities like text alignment or image quality. Extensive evaluations have been conducted to verify the effectiveness of LaVi-Bridge. Code is available at https://github.com/ShihaoZhaoZSH/LaVi-Bridge.

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 11 figures, 4 tables)

This paper contains 19 sections, 2 equations, 11 figures, 4 tables.

Introduction
Related Work
Language Models and Generative Vision Models
Text-to-Image Diffusion Models
Method
Preliminary
Language and Vision Alignment
Design Details
Experiments
Experimental Settings
Evaluation on Different Language Models
Evaluation on Different Vision Models
Ablation Study
Conclusion
Long Prompts
...and 4 more sections

Figures (11)

Figure 1: Overview of LaVi-Bridge. LaVi-Bridge is capable of integrating various language models and generative vision models. On the left side, we keep the vision model fixed and experiment with different language models in our pipeline. On the right side, we keep the language model fixed and try out different vision models. We display the visualization results alongside each combination.
Figure 2: Pipeline of LaVi-Bridge. We select one model each from the language and vision model pools. We then freeze the pre-trained language and vision models and incorporate LoRA into both models. The connection between the language and vision models is established through an adapter. The only weights we need to train are the ones introduced by LoRA and the adapter.
Figure 3: Visualization results of LaVi-Bridge with different language models. The first row to the fifth row present the results with CLIP text encoder, T5-Small, T5-Base, T5-Large, and Llama-2, respectively. The prompts are displayed at the top or bottom of each column.
Figure 4: Visualization results of LaVi-Bridge under different generative vision models. The first row to the third row present the results with U-Net in Latent Diffusion Model, U-Net in Stable Diffusion V1.4 and transformer in PixArt, respectively. The prompts are displayed at the top or bottom of each column.
Figure 5: User study. The two disk diagrams on the left display the user's scoring results on different language models, while the two disk diagrams on the right display the user's scoring results on different generative vision models. The percentage represents the proportion of the score obtained by a model out of the total score of all models.
...and 6 more figures

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

TL;DR

Abstract

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)