Table of Contents
Fetching ...

Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models

Byeonghu Na, Minsang Park, Gyuwon Sim, Donghyeok Shin, HeeSun Bae, Mina Kang, Se Jung Kwon, Wanmo Kang, Il-Chul Moon

TL;DR

Diffusion Adaptive Text Embedding is proposed, which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data, and maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks.

Abstract

Text-to-image diffusion models rely on text embeddings from a pre-trained text encoder, but these embeddings remain fixed across all diffusion timesteps, limiting their adaptability to the generative process. We propose Diffusion Adaptive Text Embedding (DATE), which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data. We formulate an optimization problem and derive an update rule that refines the text embeddings at each sampling step to improve alignment and preference between the mean predicted image and the text. This allows DATE to dynamically adapts the text conditions to the reverse-diffused images throughout diffusion sampling without requiring additional model training. Through theoretical analysis and empirical results, we show that DATE maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks, including multi-concept generation and text-guided image editing. Our code is available at https://github.com/aailab-kaist/DATE.

Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models

TL;DR

Diffusion Adaptive Text Embedding is proposed, which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data, and maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks.

Abstract

Text-to-image diffusion models rely on text embeddings from a pre-trained text encoder, but these embeddings remain fixed across all diffusion timesteps, limiting their adaptability to the generative process. We propose Diffusion Adaptive Text Embedding (DATE), which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data. We formulate an optimization problem and derive an update rule that refines the text embeddings at each sampling step to improve alignment and preference between the mean predicted image and the text. This allows DATE to dynamically adapts the text conditions to the reverse-diffused images throughout diffusion sampling without requiring additional model training. Through theoretical analysis and empirical results, we show that DATE maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks, including multi-concept generation and text-guided image editing. Our code is available at https://github.com/aailab-kaist/DATE.

Paper Structure

This paper contains 41 sections, 7 theorems, 34 equations, 22 figures, 13 tables, 1 algorithm.

Key Result

Proposition 0

Let $h({\mathbf{c}}_{1},\cdots,{\mathbf{c}}_{T}) \coloneqq \mathbb{E}_{{\mathbf{x}}_{0:T}} [h({\mathbf{x}}_0;y)]$ where ${\mathbf{x}}_{0:T-1} \sim \prod_{\tau=1}^{T} p_{\bm{\theta}}({\mathbf{x}}_{\tau-1}|{\mathbf{x}}_{\tau},{\mathbf{c}}_\tau)$, ${\mathbf{x}}_T \sim p_T$, and $\mathcal{C}_t \coloneq

Figures (22)

  • Figure 1: (a) Overview of the conventional fixed text embedding and the proposed adaptive text embedding during the text-to-image diffusion sampling process. Green shapes represent the diffusion model network, orange shapes represent the text encoder, and gray boxes labeled Opt. indicate our text embedding optimization, detailed in \ref{['fig:method']}. (b) ImageReward xu2023imagereward, a text-to-image generation metric, for mean predicted images. Red triangles mark the timesteps where text embedding is updated.
  • Figure 2: Examples of text-conditioned evaluation function.
  • Figure 2: Results on COCO using SD v1.5 with various evaluation functions. Bold values indicate the best performance, while italic values denote cases that underperform the fixed embedding.
  • Figure 3: Update step of DATE at timestep $t$. The symbol with an inverted triangle inside a circle represents the normalized gradient with respect to ${\mathbf{c}}$, and $\oplus$ denotes summation.
  • Figure 3: Results on COCO across backbones.
  • ...and 17 more figures

Theorems & Definitions (12)

  • Proposition 0
  • Theorem 1
  • Proposition 1
  • proof
  • Theorem 1
  • proof
  • Proposition 1
  • proof
  • Lemma 2
  • proof : Proof of Lemma
  • ...and 2 more