Table of Contents
Fetching ...

Decoder-Only LLMs are Better Controllers for Diffusion Models

Ziyi Dong, Yao Xiao, Pengxu Wei, Liang Lin

TL;DR

The paper tackles the inefficiency of diffusion-based text-to-image generation caused by encoder-reliant textual conditioning. It introduces LLMDiff Adapter, a plug-in module that uses decoder-only LLMs as diffusion controllers by adding a second cross-attention path and trainable fusion with the pre-trained diffusion model. The authors provide theoretical analysis treating LLM blocks as diffusion steps and outline how to extract useful text encodings from decoder-only LLMs through Langevin dynamics and score-based methods. Empirically, LLMDiff demonstrates improved text-image alignment, richer details, and stronger reasoning on a 1M-scale dataset, outperforming encoder-based baselines on multiple metrics while maintaining compatibility with pre-trained diffusion models.

Abstract

Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the text prompt input with a pre-trained encoder structure, which is usually trained on a limited number of image-caption pairs. The state-of-the-art large language models (LLMs) based on the decoder-only structure have shown a powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models, and devise a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.

Decoder-Only LLMs are Better Controllers for Diffusion Models

TL;DR

The paper tackles the inefficiency of diffusion-based text-to-image generation caused by encoder-reliant textual conditioning. It introduces LLMDiff Adapter, a plug-in module that uses decoder-only LLMs as diffusion controllers by adding a second cross-attention path and trainable fusion with the pre-trained diffusion model. The authors provide theoretical analysis treating LLM blocks as diffusion steps and outline how to extract useful text encodings from decoder-only LLMs through Langevin dynamics and score-based methods. Empirically, LLMDiff demonstrates improved text-image alignment, richer details, and stronger reasoning on a 1M-scale dataset, outperforming encoder-based baselines on multiple metrics while maintaining compatibility with pre-trained diffusion models.

Abstract

Groundbreaking advancements in text-to-image generation have recently been achieved with the emergence of diffusion models. These models exhibit a remarkable ability to generate highly artistic and intricately detailed images based on textual prompts. However, obtaining desired generation outcomes often necessitates repetitive trials of manipulating text prompts just like casting spells on a magic mirror, and the reason behind that is the limited capability of semantic understanding inherent in current image generation models. Specifically, existing diffusion models encode the text prompt input with a pre-trained encoder structure, which is usually trained on a limited number of image-caption pairs. The state-of-the-art large language models (LLMs) based on the decoder-only structure have shown a powerful semantic understanding capability as their architectures are more suitable for training on very large-scale unlabeled data. In this work, we propose to enhance text-to-image diffusion models by borrowing the strength of semantic understanding from large language models, and devise a simple yet effective adapter to allow the diffusion models to be compatible with the decoder-only structure. Meanwhile, we also provide a supporting theoretical analysis with various architectures (e.g., encoder-only, encoder-decoder, and decoder-only), and conduct extensive empirical evaluations to verify its effectiveness. The experimental results show that the enhanced models with our adapter module are superior to the stat-of-the-art models in terms of text-to-image generation quality and reliability.

Paper Structure

This paper contains 20 sections, 11 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Comparison with other neural network structures employed for computing text encoding in diffusion models. Our proposed LLMDiff, which leverages a decoder-only structure by casting the transformer-based language model as a diffusion model, can predict the text encodings for text-to-image generation by integrating layer-wise representations in the language model. Intuitively, compared with other structures (e.g. encoder-decoder) our LLMDiff is more powerful in exploring the semantic meanings and dependency among words from the input text prompt. More details and theoretical derivations are provided in \ref{['sec:decoder_encodings']}.
  • Figure 2: Our LLMDiff-Adapter framework, wherein the parameters of both the LLM and the diffusion U-Net (including the original cross-attention module) are frozen during training. The newly added cross-attention module employs two adaptive-weight parameters to incorporate with the original one, which is dynamically adjusted during training.
  • Figure 3: In comparison with existing approaches, LLMDiff exhibits superior capabilities in both language comprehension and action understanding. Furthermore, it is proficient in generating images with high-quality details.
  • Figure 4: Model evaluation on the capability of causal and logical reasoning for text-to-image generation.
  • Figure 5: The scale factor of newly added attentions and the original attentions in each cross-attention module of U-Net.