Table of Contents
Fetching ...

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun

TL;DR

This work investigates spatial acceleration for DiTs via latent upsampling and proposes a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate artifacts while achieving spatial acceleration of DiTs by the authors' mixed-resolution latent upsampling.

Abstract

Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that naïve latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0$\times$ speedup on FLUX-1.dev and 3.0$\times$ on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9$\times$ speedup.

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

TL;DR

This work investigates spatial acceleration for DiTs via latent upsampling and proposes a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate artifacts while achieving spatial acceleration of DiTs by the authors' mixed-resolution latent upsampling.

Abstract

Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that naïve latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0 speedup on FLUX-1.dev and 3.0 on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9 speedup.

Paper Structure

This paper contains 68 sections, 21 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Generated 1024$\times$1024 images using acceleration methods on FLUX-1.dev black2024flux for 7$\times$ speedups. While temporal acceleration methods struggle with aggressive speedups and Bottleneck Sampling tian2025training introduces artifacts, our RALU successfully accelerates while avoiding artifacts and maintaining high image quality.
  • Figure 2: (a) An example of aliasing artifacts generated using FLUX-1.dev (prompt: "A man on the tennis court is about to use his racket") with 9 low-resolution steps, 2$\times$ upsampling, and 9 full-resolution steps. (b) Edge energy and aliasing artifact ratio over image vs. upsampling timestep, averaged over 100 images.
  • Figure 3: (a) An example of mismatching artifacts generated using FLUX-1.dev (prompt: "A group of people standing on top of a snow covered ski slope") with early upsampling ($t_{up} = 0.3$) and noise injection. (b) ImageReward xu2023imagereward score and mismatching artifact ratio vs. JSD, averaged over 100 images.
  • Figure 4: Overview of the proposed RALU framework. RALU consists of three different resolution processes: (1) low-resolution sampling for early denoising, (2) mixed-resolution sampling by upsampling edge region latents, and (3) full-resolution refinement by upsampling all remaining latents. (a) We select the top $r$ fraction of patches with the strongest edge signals from the decoded image and upsample them early. (b) We add correlated noise to the upsampled latents and design a corresponding timestep schedule. (See \ref{['subsec:NT-rescheduling']} for more details).
  • Figure 5: Resolving artifacts from naïve latent upsampling. Aliasing artifacts are avoided by (B) early upsampling, while Mismatching artifacts are mitigated by (C) noise and timestep matching.
  • ...and 16 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3