E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models
Zhiyu Tan, WenXu Qian, Hesen Chen, Mengping Yang, Lei Chen, Hao Li
TL;DR
Diffusion models face three core issues: a training-inference gap, information leakage in the forward noising process, and difficulty integrating perceptual/adversarial losses. The authors propose E2ED$^2$, an end-to-end framework that directly maps Gaussian noise to the data distribution, aligning training with sampling and enabling joint optimization with perceptual and GAN losses. On COCO30K and HW30K, E2ED$^2$ achieves improved $FID$ and $CLIP$ scores while using few sampling steps ($T<4$) and a compact latent-diffusion backbone, outperforming several state-of-the-art baselines. This approach potentially unifies diffusion stability with GAN-like discriminative optimization, offering a more robust and efficient path for high-quality text-to-image generation.
Abstract
Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling, revolutionizing the field through remarkable success across various diverse applications ranging from high-quality image synthesis to temporal aware video generation. Despite these advancements, three fundamental limitations persist, including 1) discrepancy between training and inference processes, 2) progressive information leakage throughout the noise corruption procedures, and 3) inherent constraints preventing effective integration of modern optimization criteria like perceptual and adversarial loss. To mitigate these critical challenges, we in this paper present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises. Our proposed End-to-End Differentiable Diffusion, dubbed E2ED^2, introduces several key improvements: it eliminates the sequential training-sampling mismatch and intermediate information leakage via conceptualizing training as a direct transformation from isotropic Gaussian noise to the target data distribution. Additionally, such training framework enables seamless incorporation of adversarial and perceptual losses into the core optimization objective. Comprehensive evaluation across standard benchmarks including COCO30K and HW30K reveals that our method achieves substantial performance gains in terms of Fréchet Inception Distance (FID) and CLIP score, even with fewer sampling steps (less than 4). Our findings highlight that the end-to-end mechanism might pave the way for more robust and efficient solutions, \emph{i.e.,} combining diffusion stability with GAN-like discriminative optimization in an end-to-end manner.
