DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai
TL;DR
DC-AE 1.5 tackles the slow convergence of high-channel latent diffusion models by introducing a structured latent space that assigns object-structure information to front latent channels and details to back channels, and an Augmented Diffusion Training strategy that imposes auxiliary learning objectives on the object channels. This combination yields faster diffusion-model convergence and improved scaling, enabling higher spatial compression ratios (e.g., f64c128) while achieving competitive quality and higher throughput on ImageNet. Across ImageNet 256×256 and 512×512, DC-AE 1.5 outperforms the prior DC-AE setup in both learning speed and final image quality, with notable gains on large backbones like USiT-2B/3B and with no classifier-free guidance. The work provides a practical path to pushing the quality upper bound of latent diffusion models by making the latent space more diffusion-friendly, potentially unlocking room for higher compression and larger models in high-resolution image synthesis.
Abstract
We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.
