Table of Contents
Fetching ...

E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models

Zhiyu Tan, WenXu Qian, Hesen Chen, Mengping Yang, Lei Chen, Hao Li

TL;DR

Diffusion models face three core issues: a training-inference gap, information leakage in the forward noising process, and difficulty integrating perceptual/adversarial losses. The authors propose E2ED$^2$, an end-to-end framework that directly maps Gaussian noise to the data distribution, aligning training with sampling and enabling joint optimization with perceptual and GAN losses. On COCO30K and HW30K, E2ED$^2$ achieves improved $FID$ and $CLIP$ scores while using few sampling steps ($T<4$) and a compact latent-diffusion backbone, outperforming several state-of-the-art baselines. This approach potentially unifies diffusion stability with GAN-like discriminative optimization, offering a more robust and efficient path for high-quality text-to-image generation.

Abstract

Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling, revolutionizing the field through remarkable success across various diverse applications ranging from high-quality image synthesis to temporal aware video generation. Despite these advancements, three fundamental limitations persist, including 1) discrepancy between training and inference processes, 2) progressive information leakage throughout the noise corruption procedures, and 3) inherent constraints preventing effective integration of modern optimization criteria like perceptual and adversarial loss. To mitigate these critical challenges, we in this paper present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises. Our proposed End-to-End Differentiable Diffusion, dubbed E2ED^2, introduces several key improvements: it eliminates the sequential training-sampling mismatch and intermediate information leakage via conceptualizing training as a direct transformation from isotropic Gaussian noise to the target data distribution. Additionally, such training framework enables seamless incorporation of adversarial and perceptual losses into the core optimization objective. Comprehensive evaluation across standard benchmarks including COCO30K and HW30K reveals that our method achieves substantial performance gains in terms of Fréchet Inception Distance (FID) and CLIP score, even with fewer sampling steps (less than 4). Our findings highlight that the end-to-end mechanism might pave the way for more robust and efficient solutions, \emph{i.e.,} combining diffusion stability with GAN-like discriminative optimization in an end-to-end manner.

E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models

TL;DR

Diffusion models face three core issues: a training-inference gap, information leakage in the forward noising process, and difficulty integrating perceptual/adversarial losses. The authors propose E2ED, an end-to-end framework that directly maps Gaussian noise to the data distribution, aligning training with sampling and enabling joint optimization with perceptual and GAN losses. On COCO30K and HW30K, E2ED achieves improved and scores while using few sampling steps () and a compact latent-diffusion backbone, outperforming several state-of-the-art baselines. This approach potentially unifies diffusion stability with GAN-like discriminative optimization, offering a more robust and efficient path for high-quality text-to-image generation.

Abstract

Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling, revolutionizing the field through remarkable success across various diverse applications ranging from high-quality image synthesis to temporal aware video generation. Despite these advancements, three fundamental limitations persist, including 1) discrepancy between training and inference processes, 2) progressive information leakage throughout the noise corruption procedures, and 3) inherent constraints preventing effective integration of modern optimization criteria like perceptual and adversarial loss. To mitigate these critical challenges, we in this paper present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises. Our proposed End-to-End Differentiable Diffusion, dubbed E2ED^2, introduces several key improvements: it eliminates the sequential training-sampling mismatch and intermediate information leakage via conceptualizing training as a direct transformation from isotropic Gaussian noise to the target data distribution. Additionally, such training framework enables seamless incorporation of adversarial and perceptual losses into the core optimization objective. Comprehensive evaluation across standard benchmarks including COCO30K and HW30K reveals that our method achieves substantial performance gains in terms of Fréchet Inception Distance (FID) and CLIP score, even with fewer sampling steps (less than 4). Our findings highlight that the end-to-end mechanism might pave the way for more robust and efficient solutions, \emph{i.e.,} combining diffusion stability with GAN-like discriminative optimization in an end-to-end manner.
Paper Structure (33 sections, 8 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 33 sections, 8 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of two training methods for diffusion models. (a) illustrates the single-step training method, where the model is trained to predict noise in a single denoising step for randomly sampled time steps. This approach introduces a training-sampling gap, as the training focuses on single-step denoising, whereas testing requires iterative multi-step denoising. Furthermore, the forward noising process can result in information leakage, where the final state $x_T$ deviates from ideal Gaussian noise, compromising the reconstruction quality. (b) illustrates our proposed end-to-end training method, E2ED$^2$, which directly optimizes the entire sampling trajectory. Beginning with pure Gaussian noise, the model generates images through multi-step sampling, aligning the training and testing processes seamlessly. This approach effectively eliminates information leakage and enables the integration of advanced loss functions, such as perceptual and GAN losses, thereby improving the fidelity, semantic consistency, and overall quality of the generated images.
  • Figure 2: Qualitative comparison of synthesized images across different methods on various subjects (human portraits, objects) and styles. Our method demonstrates superior performance in terms of image quality, aesthetic appeal, and text-image alignment. The generated images show finer details, richer textures, and better adherence to the input prompts compared to SOTA methods. This highlights the effectiveness of our end-to-end training framework and loss function design in balancing perceptual quality and semantic consistency.
  • Figure 3: Qualitative comparison of generated results across different loss configurations. Each column represents a specific loss setting: L1, L2, LPIPS, L2+LPIPS, and L2+LPIPS+GAN. The inclusion of LPIPS loss improves perceptual quality, while combining L2+LPIPS with GAN loss adds high-frequency details, such as fine hair strands in portraits and intricate petal textures in flowers. Although this introduces a slight trade-off in text-image alignment, the combination of L2, LPIPS, and GAN losses achieves the best balance, producing realistic and semantically aligned outputs.
  • Figure 4: Comparison of training loss curves between parameter-sharing and parameter-independent configurations. The results demonstrate that parameter sharing consistently achieves lower loss values, highlighting its effectiveness in stabilizing training and improving overall model performance.
  • Figure 5: User Interface Demonstration: Our custom-designed user interface sequentially presents evaluators with pairs of images alongside the corresponding generation prompt. Additionally, we formulated three specific evaluation questions to comprehensively measure user preferences across three distinct dimensions.
  • ...and 2 more figures