Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Ruoyu Wang; Yongqi Yang; Zhihao Qian; Ye Zhu; Yu Wu

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Ruoyu Wang, Yongqi Yang, Zhihao Qian, Ye Zhu, Yu Wu

TL;DR

This work investigates the diffusion (physics) in diffusion (machine learning) properties and proposes the Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved.

Abstract

Originating from the diffusion phenomenon in physics that describes particle movement, the diffusion generative models inherit the characteristics of stochastic random walk in the data space along the denoising trajectory. However, the intrinsic mutual interference among image regions contradicts the need for practical downstream application scenarios where the preservation of low-level pixel information from given conditioning is desired (e.g., customization tasks like personalized generation and inpainting based on a user-provided single image). In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties and propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon given a pre-trained frozen diffusion model for versatile customization application scenarios, where the low-level pixel information from the conditioning needs to be preserved. Notably, unlike most current methods that incorporate additional conditions by fine-tuning the base text-to-image diffusion model or learning auxiliary networks, our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios in a learning-free manner. Extensive experiment results show that our proposed COW can achieve more flexible customization based on strict visual conditions in different application settings. Project page: https://wangruoyu02.github.io/cow.github.io/.

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 18 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 3 equations, 18 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Diffusion in Diffusion
Training-free Cyclic One-way Diffusion
Experiments
Experiments Setup
Experimental Results
Conclusion and Discussion
Appendix
TV2I Generation Results Comparison and Analysis
Degree of Changes on Visual Conditions
Multiple Visual Condition Results
Different Sizes of Visual Condition
...and 8 more sections

Figures (18)

Figure 1: Comparison with existing SOTA methods for maintaining the fidelity of text and visual conditions in different application scenarios. We consistently achieve superior fidelity to both text and visual conditions across all three settings. In contrast, other learning-based approaches struggle to attain the same level of performance across diverse scenarios.
Figure 2: Illustration of "diffusion in diffusion". We inverse pictures of pure gray and white back to $\mathbf{x_t}$, merge them together with different layouts, and then regenerate them back to $x_0$ via deterministic denoising. Different columns indicate different replacement steps $t$. The resulting images show how regions within an image diffuse and interfere with each other during denoising.
Figure 3: The pipeline of our proposed COW method. Given the input visual condition, we stick it on a predefined background and inverse it as the seed initialization of the starting point of the cycle. In the Cyclic One-Way Diffusion process, we "disturb" and "reconstruct" the image in a cyclic way and ensure a continuous one-way diffusion by consistently replacing it with corresponding $\mathbf{x_t}$.
Figure 4: The adaptation of the visual condition to align with the text condition while maintaining the semantic and pixel-level information of the visual condition. In each pair of images, the smaller image is the given visual condition and the other is the generated result. The bolded parts of the text conditions highlight the conflicts between conditions.
Figure 5: Analysis of the cycling process that diffuses "visual seed" to its surroundings. The leftmost figure shows a given face condition. The right shows the images generated with given text conditions. The cycle number increases from the left to the right.
...and 13 more figures

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

TL;DR

Abstract

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (18)