Contextualized Diffusion Models for Text-Guided Image and Video Generation

Ling Yang; Zhilong Zhang; Zhaochen Yu; Jingwei Liu; Minkai Xu; Stefano Ermon; Bin Cui

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano Ermon, Bin Cui

TL;DR

ContextDiff addresses the mismatch between forward and reverse conditioning in text-guided diffusion models by injecting cross-modal context into the full diffusion trajectory. It generalizes the contextualized trajectory adapter to DDPMs and DDIMs, aligning forward and reverse processes to better convey textual semantics. Empirically, ContextDiff achieves new state-of-the-art results on text-to-image generation and text-to-video editing, with improvements in semantic alignment and temporal consistency. The work provides code and demonstrates the practical impact of cross-modal trajectory conditioning for multimodal synthesis.

Abstract

Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We generalize our contextualized diffusion to both DDPMs and DDIMs with theoretical derivations, and demonstrate the effectiveness of our model in evaluations with two challenging tasks: text-to-image generation, and text-to-video editing. In each task, our ContextDiff achieves new state-of-the-art performance, significantly enhancing the semantic alignment between text condition and generated samples, as evidenced by quantitative and qualitative evaluations. Our code is available at https://github.com/YangLing0818/ContextDiff

Contextualized Diffusion Models for Text-Guided Image and Video Generation

TL;DR

Abstract

Paper Structure (37 sections, 6 theorems, 37 equations, 15 figures, 8 tables)

This paper contains 37 sections, 6 theorems, 37 equations, 15 figures, 8 tables.

Introduction
Related Work
Text-Guided Visual Diffusion Models
Diffusion Trajectory Optimization
Method
Cross-Modal Contextualized Diffusion
Adapting Reverse Process
Simplified Training Objective
Context-Aware Sampling
Generalizing Contextualized Diffusion to DDIMs
Experiments
Text-to-Image Generation
Datasets and Metrics.
Implementation Details.
Quantitative and Qualitative Results
...and 22 more sections

Key Result

lemma 1

For the forward process $q({\bm{x}}_1,{\bm{x}}_2,...,{\bm{x}}_T|{\bm{x}}_0,{\bm{c}}) = \prod_{t=1}^T q({\bm{x}}_t|{\bm{x}}_{t-1},{\bm{x}}_0,{\bm{c}})$, if the transition kernel $q({\bm{x}}_t|{\bm{x}}_{t-1},{\bm{x}}_0,{\bm{c}})$ is defined as eq-ddpm-forward2, then the conditional distribution $q({\b

Figures (15)

Figure 1: A simplified illustration of text-guided visual diffusion models with (a) conventional forward and reverse diffusion processes, (b) our contextualized forward and reverse diffusion processes. $\Tilde{x}_0$ denotes the estimation of visual sample by the denoising network at each timestep.
Figure 2: Qualitative comparison in text-to-video editing, edited text prompt is denoted in color. Our ContextDiff achieves best semantic alignment, image fidelity, and editing quality.
Figure 3: Illustration of our ContextDiff.
Figure 4: Synthesis examples demonstrating text-to-image capabilities of for various text prompts with LDM, Imagen, and ContextDiff (Ours). Our model can better express the semantics of the texts marked in blue. We use red boxes to highlight critical fine-grained parts where LDM and Imagen fail to align with texts. For example, in second row, only our method successfully generates the four letters spelling "LOVE". In third row, we generate the specific detail of a film roll, while other methods lose this detail.
Figure 5: Generalizing our context-aware adapter to Tune-A-Video wu2022tune.
...and 10 more figures

Theorems & Definitions (11)

lemma 1
Proof 1
Proposition 1
Proof 2
lemma 2
Proof 3
lemma 3
Proof 4
lemma 4
Proof 5
...and 1 more

Contextualized Diffusion Models for Text-Guided Image and Video Generation

TL;DR

Abstract

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (11)