Table of Contents
Fetching ...

SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

Hesen Chen, Junyan Wang, Zhiyu Tan, Hao Li

TL;DR

SARA addresses the trade-off between training efficiency and image quality in diffusion models by extending REPA with a hierarchy of representation alignment: patch-wise, structural autocorrelation, and adversarial distribution alignment. It introduces a compact architecture with a pretrained visual encoder, a diffusion network, an MLP projector, and a lightweight discriminator, and optimizes a joint loss to enforce both local fidelity and global distribution coherence. Empirical results on ImageNet-256 show faster convergence and state-of-the-art or near-state-of-the-art FID scores under CFG and non-CFG regimes. The work demonstrates that multi-level, structure-aware alignment yields substantial gains in synthesis quality and training efficiency, suggesting broad applicability to diffusion transformers and related generative systems.

Abstract

Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.

SARA: Structural and Adversarial Representation Alignment for Training-efficient Diffusion Models

TL;DR

SARA addresses the trade-off between training efficiency and image quality in diffusion models by extending REPA with a hierarchy of representation alignment: patch-wise, structural autocorrelation, and adversarial distribution alignment. It introduces a compact architecture with a pretrained visual encoder, a diffusion network, an MLP projector, and a lightweight discriminator, and optimizes a joint loss to enforce both local fidelity and global distribution coherence. Empirical results on ImageNet-256 show faster convergence and state-of-the-art or near-state-of-the-art FID scores under CFG and non-CFG regimes. The work demonstrates that multi-level, structure-aware alignment yields substantial gains in synthesis quality and training efficiency, suggesting broad applicability to diffusion transformers and related generative systems.

Abstract

Modern diffusion models encounter a fundamental trade-off between training efficiency and generation quality. While existing representation alignment methods, such as REPA, accelerate convergence through patch-wise alignment, they often fail to capture structural relationships within visual representations and ensure global distribution consistency between pretrained encoders and denoising networks. To address these limitations, we introduce SARA, a hierarchical alignment framework that enforces multi-level representation constraints: (1) patch-wise alignment to preserve local semantic details, (2) autocorrelation matrix alignment to maintain structural consistency within representations, and (3) adversarial distribution alignment to mitigate global representation discrepancies. Unlike previous approaches, SARA explicitly models both intra-representation correlations via self-similarity matrices and inter-distribution coherence via adversarial alignment, enabling comprehensive alignment across local and global scales. Experiments on ImageNet-256 show that SARA achieves an FID of 1.36 while converging twice as fast as REPA, surpassing recent state-of-the-art image generation methods. This work establishes a systematic paradigm for optimizing diffusion training through hierarchical representation alignment.

Paper Structure

This paper contains 18 sections, 11 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: SARA enhances the efficiency and effectiveness of diffusion model training through multi-level representation alignment. Compared to REPA, it achieves 2$\times$ faster convergence speed, significantly accelerating the training process.
  • Figure 2: Analysis of representation correlations and redundancy across DINOv2, REPA, and SARA.
  • Figure 3: Overview of the SARA framework. SARA aligns diffusion model representations with powerful pretrained visual features through a combination of complementary alignment strategies.
  • Figure 4: Generated Image Comparison of REPA vs SARA. We compare images generated by two SiT-XL/2 models over the first 400K iterations, with one model using REPA and the other using SARA. Both models are initialized with the same noise, employ the same sampler and number of sampling steps, and do not use classifier-free guidance.
  • Figure 5: Selected samples from the SiT-XL/2+SARA model on ImageNet 256×256, generated using classifier-free guidance with $w=4.0$.
  • ...and 16 more figures