Table of Contents
Fetching ...

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Mushui Liu, Yuhang Ma, Yang Zhen, Jun Dan, Yunlong Yu, Zeng Zhao, Zhipeng Hu, Bai Liu, Changjie Fan

TL;DR

LLM4GEN tackles the challenge of rendering complex text prompts in diffusion-based text-to-image generation by embedding LLM semantic representations into the existing text conditioning through a Cross-Adapter Module (CAM). It introduces an entity-guidance regularization loss to better align entity-attribute relationships and DensePrompts to robustly evaluate long, compositional prompts. The approach remains plug-and-play with diffusion backbones like SD1.5 and SDXL, reducing training data needs while achieving improvements in image-text alignment and sample quality. Experimental results across MSCOCO, T2I-CompBench, and the DensePrompts benchmark demonstrate strong gains in color, texture, and semantic fidelity, along with favorable human evaluations and improved efficiency.

Abstract

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called \textbf{LLM4GEN}, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains $7,000$ dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69\% and 12.90\% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

TL;DR

LLM4GEN tackles the challenge of rendering complex text prompts in diffusion-based text-to-image generation by embedding LLM semantic representations into the existing text conditioning through a Cross-Adapter Module (CAM). It introduces an entity-guidance regularization loss to better align entity-attribute relationships and DensePrompts to robustly evaluate long, compositional prompts. The approach remains plug-and-play with diffusion backbones like SD1.5 and SDXL, reducing training data needs while achieving improvements in image-text alignment and sample quality. Experimental results across MSCOCO, T2I-CompBench, and the DensePrompts benchmark demonstrate strong gains in color, texture, and semantic fidelity, along with favorable human evaluations and improved efficiency.

Abstract

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In this paper, we propose a novel framework called \textbf{LLM4GEN}, which enhances the semantic understanding of text-to-image diffusion models by leveraging the representation of Large Language Models (LLMs). It can be seamlessly incorporated into various diffusion models as a plug-and-play component. A specially designed Cross-Adapter Module (CAM) integrates the original text features of text-to-image models with LLM features, thereby enhancing text-to-image generation. Additionally, to facilitate and correct entity-attribute relationships in text prompts, we develop an entity-guided regularization loss to further improve generation performance. We also introduce DensePrompts, which contains dense prompts to provide a comprehensive evaluation for the text-to-image generation task. Experiments indicate that LLM4GEN significantly improves the semantic alignment of SD1.5 and SDXL, demonstrating increases of 9.69\% and 12.90\% in color on T2I-CompBench, respectively. Moreover, it surpasses existing models in terms of sample quality, image-text alignment, and human evaluation.
Paper Structure (21 sections, 6 equations, 9 figures, 5 tables)

This paper contains 21 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Architecture comparison between (a) LLM-guidance models (b) LLM-alignment models and (c) our proposed LLM4GEN.
  • Figure 2: Image generation using concise and dense prompts, with colored text highlighting key entities or attributes(Zoom in for details).
  • Figure 3: The overview of LLM4GEN. (a) Framework. (b) Cross-Adapter Module.
  • Figure 4: Statistic of DensePrompts benchmark compared with other benchmarks.
  • Figure 5: Aesthetic Score and CLIP Score (%) on DensePrompts benchmark.
  • ...and 4 more figures