Table of Contents
Fetching ...

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

Mengping Yang, Zhiyu Tan, Binglei Li, Xiaomeng Yang, Hesen Chen, Hao Li

TL;DR

DiverseDiT is proposed, a novel framework that explicitly promotes representation diversity in DiTs that incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features.

Abstract

Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

TL;DR

DiverseDiT is proposed, a novel framework that explicitly promotes representation diversity in DiTs that incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features.

Abstract

Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.
Paper Structure (38 sections, 16 equations, 22 figures, 19 tables)

This paper contains 38 sections, 16 equations, 22 figures, 19 tables.

Figures (22)

  • Figure 1: CKA representation similarities of models trained on various settings. We can observe that 1) the discrepancies between different blocks increases as training progresses; 2) aligning specific blocks significantly increases the dissimilarity between the corresponding block and other blocks; 3) aligning on more blocks with different pretrained encoders brings marginal performance improvements. Detailed quantitative results are provided in \ref{['sec:supp:analysis_details']}.
  • Figure 2: Generated samples from different training iterations. Images are sampled using the same seed, noise and class label. We use a classifier-free guidance scale of 4.0 during sampling.
  • Figure 3: Generated samples on ImageNet 256$\times$256 from our DiverseDiT. We use a classifier-free guidance scale of 4.0.
  • Figure A1: Detailed diagram of our proposed DiverseDiT. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features.
  • Figure A2: CKA representation similarities across different timesteps. The representational discrepancies across different timesteps show similar correlations.
  • ...and 17 more figures