Table of Contents
Fetching ...

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim

TL;DR

This work tackles high-resolution image synthesis with pretrained diffusion models without retraining. It introduces Neighborhood Patch Attention (NPA) to drastically reduce self-attention redundancy, and combines Latent Frequency Mixing (LFM) with Structure Guidance (SG) in a latent-space upsample–denoise pipeline (SDEdit-based) to preserve global structure while enriching detail. The approach is model-agnostic and demonstrates state-of-the-art performance among training-free methods on both U-Net and Diffusion Transformer architectures, offering substantial speedups (e.g., up to 3–8× over certain baselines) and higher-quality outputs at 4096^2. These results enable practical, high-fidelity, high-resolution diffusion-based synthesis across different model families with reduced computational burden.

Abstract

Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion

TL;DR

This work tackles high-resolution image synthesis with pretrained diffusion models without retraining. It introduces Neighborhood Patch Attention (NPA) to drastically reduce self-attention redundancy, and combines Latent Frequency Mixing (LFM) with Structure Guidance (SG) in a latent-space upsample–denoise pipeline (SDEdit-based) to preserve global structure while enriching detail. The approach is model-agnostic and demonstrates state-of-the-art performance among training-free methods on both U-Net and Diffusion Transformer architectures, offering substantial speedups (e.g., up to 3–8× over certain baselines) and higher-quality outputs at 4096^2. These results enable practical, high-fidelity, high-resolution diffusion-based synthesis across different model families with reduced computational burden.

Abstract

Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.

Paper Structure

This paper contains 20 sections, 4 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison between U-Net (SDXL) and DiT (FLUX). Zoom in for a better view. Elapsed time to generate the image is shown in the top-left corner. Images are generated at $4096^2$.
  • Figure 2: Process of NPA.
  • Figure 3: Comparison between different reference latents.
  • Figure 4: Overview of our pipeline. ScaleDiff starts from a generated low-resolution latent, upsamples it with LFM, and diffuses it to an intermediate timestep $\tau$. At each denoising step, the network—integrated with NPA—applies structure guidance to preserve the global image structure.
  • Figure 5: Qualitative comparison with other methods. All images are generated at $4096^2$ from the same low-resolution input. Zoom in for a better view.
  • ...and 8 more figures