Non-Cross Diffusion for Semantic Consistency

Ziyang Zheng; Ruiyuan Gao; Qiang Xu

Non-Cross Diffusion for Semantic Consistency

Ziyang Zheng, Ruiyuan Gao, Qiang Xu

TL;DR

‘Non-Cross Diffusion’, an innovative approach in generative modeling for learning ordinary differential equation (ODE) models, strategically incorporates an ascending dimension of input to effectively connect points sampled from two distributions with uncrossed paths.

Abstract

In diffusion models, deviations from a straight generative flow are a common issue, resulting in semantic inconsistencies and suboptimal generations. To address this challenge, we introduce `Non-Cross Diffusion', an innovative approach in generative modeling for learning ordinary differential equation (ODE) models. Our methodology strategically incorporates an ascending dimension of input to effectively connect points sampled from two distributions with uncrossed paths. This design is pivotal in ensuring enhanced semantic consistency throughout the inference process, which is especially critical for applications reliant on consistent generative flows, including various distillation methods and deterministic sampling, which are fundamental in image editing and interpolation tasks. Our empirical results demonstrate the effectiveness of Non-Cross Diffusion, showing a substantial reduction in semantic inconsistencies at different inference steps and a notable enhancement in the overall performance of diffusion models.

Non-Cross Diffusion for Semantic Consistency

TL;DR

Abstract

Paper Structure (15 sections, 12 equations, 7 figures, 4 tables)

This paper contains 15 sections, 12 equations, 7 figures, 4 tables.

Introduction
Related work
Diffusion models
Conditional Image Generation
Method
Preliminary
Understanding Drawbacks of DDPM Flow
Non-Cross Diffusion
Inference Flow Consistency
Experiment
Toy Examples
Experiments on Image Generation
Ablation Study
Discussion
Conclusion

Figures (7)

Figure 1: Illustrating xFlow in Diffusion Models. (a) Demonstrates the ambiguity in training targets caused by crossing flows, leading to the xFlow problem. (b) Shows how our method eliminates flow crossing by increasing the dimensionality of network inputs, thus resolving the xFlow problem. (c) Depicts how xFlow leads to variable sampling results across different steps, undermining deterministic sampling even for Stable Diffusion LDM. (d) Top: Highlights the discrepancies between outcomes from reduced steps sampling (blue) versus standard results (from 1000 steps in red) due to xFlow. Bottom: Our method ensures consistent outputs across different sampling steps. (e) Top: Exhibits instances where xFlow causes Out-Of-Distribution (OOD) outcomes in reduced steps sampling (blue) compared to standard results (from 1000 steps in red). Bottom: Our approach minimizes the occurrence of OOD samples.
Figure 2: The overview of non-cross diffusion. Training stage: The training phase involves two cases. In Case 1, we utilize $\mathbf{0}$ as the condition and calculate loss function $L_{simple}$ as defined in Eq. \ref{['loss']}. For Case 2, we first compute $\hat{\epsilon}$ using $\mathbf{0}$ as condition. Subsequently, $\hat{\epsilon}$ is employed as the condition to calculate $L_{simple}$. Throughout the training process, Case 1 is applied with a fixed probability $p$; otherwise, Case 2 is implemented. Inference stage: During the inference phase, $\textbf{0}$ is used as the condition in the initial denoising step. This is followed by iterative utilization of the estimated noise from the previous step as the condition for subsequent steps.
Figure 3: Generated images and inference flows of Stable Diffusion 1.5 using DDIM scheduler with prompt "a photo of an astronaut riding a horse on mars" and negative prompt "bad, deformed, ugly, bad anotomy". (a)-(d) are generated with seed 0,1,2,3 respectively. The results demonstrate that the inference flows across different steps could be quite different at specific time $t$, which implies the influence of the xFlow.
Figure 4: Results of the Toy Model. (a) Comparison of Generated Distributions: This panel illustrates the distributions generated by the baseline model and our proposed model. As the number of inference steps decreases, the baseline model tends to produce a significant number of out-of-distribution (OOD) samples. In contrast, our model effectively mitigates the generation of OOD samples. (b) Trajectory Analysis: This panel compares the generated trajectories of the baseline and our models. The baseline model's inference flow often redirects at the intersection point, leading to a target OOD distribution as the inference steps decrease. Our method, however, maintains a consistent direction in the inference model, thereby straightening the trajectory.
Figure 5: Here are the generated images using DDIM with inference steps in {5, 10, 25, 50, 100, 200, 500, 1000} on Cifar-10. For baseline method, the semantic information of image with small inference step and large inference step could be greatly different, which implies that the inference flow changes its direction at some timesteps.
...and 2 more figures

Non-Cross Diffusion for Semantic Consistency

TL;DR

Abstract

Non-Cross Diffusion for Semantic Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (7)