Table of Contents
Fetching ...

U-REPA: Aligning Diffusion U-Nets to ViTs

Yuchuan Tian, Hanting Chen, Mengyu Zheng, Yuchen Liang, Chao Xu, Yunhe Wang

TL;DR

U-REPA extends REPA to diffusion U-Nets by identifying mid-network layers as optimal alignment targets, addressing spatial dimension gaps through MLP-driven upsampling, and replacing rigid tokenwise alignment with a manifold-based similarity regularization. The method achieves rapid convergence and strong generation quality, notably reaching $FID<1.5$ in $200$ epochs on ImageNet 256×256 and achieving an $FID$ of $1.41$ with substantially fewer iterations than comparable REPA baselines. The approach demonstrates scalability to higher resolutions and offers energy costs savings, indicating practical benefits for efficient diffusion model training. Overall, U-REPA broadens the applicability of representation alignment to U-Net architectures and provides actionable design guidelines for cross-architecture feature alignment between U-Nets and ViT encoders.

Abstract

Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose \textbf{U-REPA}, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $\times$ 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema. Codes: https://github.com/YuchuanTian/U-REPA

U-REPA: Aligning Diffusion U-Nets to ViTs

TL;DR

U-REPA extends REPA to diffusion U-Nets by identifying mid-network layers as optimal alignment targets, addressing spatial dimension gaps through MLP-driven upsampling, and replacing rigid tokenwise alignment with a manifold-based similarity regularization. The method achieves rapid convergence and strong generation quality, notably reaching in epochs on ImageNet 256×256 and achieving an of with substantially fewer iterations than comparable REPA baselines. The approach demonstrates scalability to higher resolutions and offers energy costs savings, indicating practical benefits for efficient diffusion model training. Overall, U-REPA broadens the applicability of representation alignment to U-Net architectures and provides actionable design guidelines for cross-architecture feature alignment between U-Nets and ViT encoders.

Abstract

Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose \textbf{U-REPA}, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach in 200 epochs or 1M iterations on ImageNet 256 256, and needs only half the total epochs to perform better than REPA under sd-vae-ft-ema. Codes: https://github.com/YuchuanTian/U-REPA

Paper Structure

This paper contains 17 sections, 6 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: The proposed U-REPA framework. We investigated and found that semantic-rich intermediate layers are the best for representation alignment, dimension and space gaps hinders alignment efficacy. To counter these challenges, we scale-up features and propose manifold alignment.
  • Figure 2: Investigating alignment with respect to encoder depths on diffusion models with skip connections.Left: SiT with skip connections. Due to the change of block functionalities due to newly established skip dependencies, the most optimal encoder depth is shifted towards the middle of the model. Right: SiT$\downarrow$, the U-Net-based SiT model. Shadowed region represents higher U-Net stage. The plot infers that stage transitions (downsampling& upsampling in U-Net) bring large block functionality gaps. Alignment within higher U-Net stage is thus necessary for alignment performance.
  • Figure 3: The convergence of average tokenwise similarities. While SiT-L/2 could achieve better tokenwise similarities, SiT$\downarrow$ converges at a lower similarity value, indicating difficulties of feature alignment.
  • Figure 4: Samples generated by SiT$\downarrow$+U-REPA at 1M iterations. The samples are generated following the setting of REPA, at $cfg=4$. Best viewed on screen.
  • Figure 5: Comparing the visual quality of SiT+REPA (upper row) and SiT$\downarrow$+U-REPA (lower row). The samples are generated following the sampling strategy that yields the State-of-the-Art FIDs in respective methods. Best viewed on screen.