Table of Contents
Fetching ...

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Junyu Chen, Dongyun Zou, Wenkun He, Junsong Chen, Enze Xie, Song Han, Han Cai

TL;DR

DC-AE 1.5 tackles the slow convergence of high-channel latent diffusion models by introducing a structured latent space that assigns object-structure information to front latent channels and details to back channels, and an Augmented Diffusion Training strategy that imposes auxiliary learning objectives on the object channels. This combination yields faster diffusion-model convergence and improved scaling, enabling higher spatial compression ratios (e.g., f64c128) while achieving competitive quality and higher throughput on ImageNet. Across ImageNet 256×256 and 512×512, DC-AE 1.5 outperforms the prior DC-AE setup in both learning speed and final image quality, with notable gains on large backbones like USiT-2B/3B and with no classifier-free guidance. The work provides a practical path to pushing the quality upper bound of latent diffusion models by making the latent space more diffusion-friendly, potentially unlocking room for higher compression and larger models in high-resolution image synthesis.

Abstract

We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.

DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

TL;DR

DC-AE 1.5 tackles the slow convergence of high-channel latent diffusion models by introducing a structured latent space that assigns object-structure information to front latent channels and details to back channels, and an Augmented Diffusion Training strategy that imposes auxiliary learning objectives on the object channels. This combination yields faster diffusion-model convergence and improved scaling, enabling higher spatial compression ratios (e.g., f64c128) while achieving competitive quality and higher throughput on ImageNet. Across ImageNet 256×256 and 512×512, DC-AE 1.5 outperforms the prior DC-AE setup in both learning speed and final image quality, with notable gains on large backbones like USiT-2B/3B and with no classifier-free guidance. The work provides a practical path to pushing the quality upper bound of latent diffusion models by making the latent space more diffusion-friendly, potentially unlocking room for higher compression and larger models in high-resolution image synthesis.

Abstract

We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder's latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.

Paper Structure

This paper contains 21 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: (a) Training Throughput Comparison under Different Autoencoder Spatial Compression Ratios. Increasing the autoencoder's spatial compression ratio effectively improves diffusion models' training efficiency by producing a latent space with fewer tokens. However, larger latent channel numbers are required to maintain satisfactory reconstruction quality. (b) rFID and gFID Results under Different Latent Channel Numbers. We use DiT-XL peebles2023scalable as the diffusion model. rFID keeps improving with more latent channels, while gFID keeps getting worse. (c) Efficiency-Quality Trade-off Comparison on ImageNet 512$\times$512. Classifer-free guidance ho2022classifier is not used. DC-AE-1.5-f64c128 delivers 4$\times$ speedup over DC-AE-f32c32 while maintaining a better image generation quality.
  • Figure 2: We visualize the channel-wise average feature here. We provide the complete latent space visualization in the supplementary material (Figure \ref{['fig:latent_space_visualization_1']} and \ref{['fig:latent_space_visualization_2']}). The visualization shows that the object structure information gets blurred if we increase the latent channel number. It makes diffusion models unable to learn object structure efficiently. As a result, we can see gradually distorted object structures when we enlarge the latent channel number, as shown in the visualization of the diffusion model's outputs. We use the DiT-XL as the diffusion model here.
  • Figure 3: Image Reconstruction Comparison. With the structured latent space, DC-AE 1.5 can reconstruct images given partial latent channels, with front latent channels reconstructing overall object structure and semantics and latter latent channels adding details. In contrast, DC-AE can not reconstruct the object structure well given partial latent channels. The decoder is fine-tuned for all settings to fully reveal the information encoded in the (partial) latent space.
  • Figure 4: Illustration of DC-AE 1.5 Autoencoder Training Strategy. The key difference from conventional autoencoder training is that we add a channel-wise random masking step before feeding the latent features to the decoder. The mask is generated randomly at each step according to Eq. \ref{['eq:mask']}. It enables the autoencoder to reconstruct with partial latent channels and naturally impose the channel-wise structure on the latent space.
  • Figure 5: (a) Illustration of Augmented Diffusion Training. We randomly generate a channel-wise mask at each training step and use it to augment diffusion training. (b) Training Curve Comparison. We achieve 6$\times$ faster convergence on UViT-H bao2023all with Augmented Diffusion Training.
  • ...and 5 more figures