Table of Contents
Fetching ...

REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

Ziqiao Wang, Wangbo Zhao, Yuhao Zhou, Zekai Li, Zhiyuan Liang, Mingjia Shi, Xuanlei Zhao, Pengfei Zhou, Kaipeng Zhang, Zhangyang Wang, Kai Wang, Yang You

TL;DR

Diffusion transformers achieve high image quality but train slowly; prior external representation guidance (REPA) accelerates early learning yet can hinder later performance due to capacity mismatch. The authors introduce HASTE, a two-phase approach that first applies Holistic Alignment (combining feature and attention guidance) and then Stage-wise Termination (disable alignment after a trigger) to free the model for denoising. Empirically, HASTE yields up to 28× faster convergence on ImageNet 256×256 (matching vanilla SiT-XL/2 by 50 epochs and REPA’s best by 500 epochs) and improves text-to-image generation on MS-COCO, demonstrating robust, architecture-agnostic acceleration across diffusion tasks. The work illuminates when external guidance helps and when it hinders, offering a principled, simple recipe for efficient diffusion training with practical impact for large-scale generative modeling.

Abstract

Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .

REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

TL;DR

Diffusion transformers achieve high image quality but train slowly; prior external representation guidance (REPA) accelerates early learning yet can hinder later performance due to capacity mismatch. The authors introduce HASTE, a two-phase approach that first applies Holistic Alignment (combining feature and attention guidance) and then Stage-wise Termination (disable alignment after a trigger) to free the model for denoising. Empirically, HASTE yields up to 28× faster convergence on ImageNet 256×256 (matching vanilla SiT-XL/2 by 50 epochs and REPA’s best by 500 epochs) and improves text-to-image generation on MS-COCO, demonstrating robust, architecture-agnostic acceleration across diffusion tasks. The work illuminates when external guidance helps and when it hinders, offering a principled, simple recipe for efficient diffusion training with practical impact for large-scale generative modeling.

Abstract

Diffusion Transformers (DiTs) deliver state-of-the-art image quality, yet their training remains notoriously slow. A recent remedy -- representation alignment (REPA) that matches DiT hidden features to those of a non-generative teacher (e.g. DINO) -- dramatically accelerates the early epochs but plateaus or even degrades performance later. We trace this failure to a capacity mismatch: once the generative student begins modelling the joint data distribution, the teacher's lower-dimensional embeddings and attention patterns become a straitjacket rather than a guide. We then introduce HASTE (Holistic Alignment with Stage-wise Termination for Efficient training), a two-phase schedule that keeps the help and drops the hindrance. Phase I applies a holistic alignment loss that simultaneously distills attention maps (relational priors) and feature projections (semantic anchors) from the teacher into mid-level layers of the DiT, yielding rapid convergence. Phase II then performs one-shot termination that deactivates the alignment loss, once a simple trigger such as a fixed iteration is hit, freeing the DiT to focus on denoising and exploit its generative capacity. HASTE speeds up training of diverse DiTs without architecture changes. On ImageNet 256X256, it reaches the vanilla SiT-XL/2 baseline FID in 50 epochs and matches REPA's best FID in 500 epochs, amounting to a 28X reduction in optimization steps. HASTE also improves text-to-image DiTs on MS-COCO, demonstrating to be a simple yet principled recipe for efficient diffusion training across various tasks. Our code is available at https://github.com/NUS-HPC-AI-Lab/HASTE .

Paper Structure

This paper contains 50 sections, 5 equations, 27 figures, 11 tables.

Figures (27)

  • Figure 1: Training SiT-XL/2 on ImageNet $256{\times}256$. Adding REPA slashes FID early on, but its benefit fades and ultimately reverses; dropping the alignment loss mid-training restores progress.
  • Figure 2: Overview of our framework. Phase I (left) distills both feature embeddings and attention maps from a frozen, non-generative teacher (DINOv2) into mid-level layers of the student DiT. When a simple trigger $\tau$ fires, the alignment loss is disabled; Phase II (right) then continues training with pure denoising.
  • Figure 3: Cosine similarity between REPA and denoising gradients. Acute $\to$ orthogonal $\to$ obtuse: the auxiliary signal turns from booster to brake.
  • Figure 4: Gradient similarity as function of diffusion timestep $t$. At $t\!=\!0.1$ (high-detail phase) the two losses already conflict even early in training.
  • Figure 5: Replacing teacher inputs with low-pass images leaves REPA’s early gain intact: evidence that the auxiliary loss transmits mainly global structure. We train SiT-L/2 for 200K iterations.
  • ...and 22 more figures