HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Shen Zhang; Zhaowei Chen; Zhenyu Zhao; Yuhao Chen; Yao Tang; Jiajun Liang

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, Jiajun Liang

TL;DR

HiDiffusion tackles two core issues in high-resolution diffusion: object duplication and slow inference. It introduces a tuning-free framework combining Resolution-Aware U-Net (RAU-Net) and Modified Shifted Window Attention (MSW-MSA) to scale pretrained models up to $4096\times4096$ with significant speedups. RAU-Net aligns deep-block feature maps via RAD and RAU, while MSW-MSA replaces costly global self-attention with large-window, timestep-shifted local attention, yielding substantial efficiency gains. Across SD 1.5/2.1/XL families and extreme resolutions, HiDiffusion delivers richer details with up to 1.5-6x faster inference compared to prior methods, enabling practical high-resolution diffusion synthesis without additional training.

Abstract

Diffusion models have become a mainstream approach for high-resolution image synthesis. However, directly generating higher-resolution images from pretrained diffusion models will encounter unreasonable object duplication and exponentially increase the generation time. In this paper, we discover that object duplication arises from feature duplication in the deep blocks of the U-Net. Concurrently, We pinpoint the extended generation times to self-attention redundancy in U-Net's top blocks. To address these issues, we propose a tuning-free higher-resolution framework named HiDiffusion. Specifically, HiDiffusion contains Resolution-Aware U-Net (RAU-Net) that dynamically adjusts the feature map size to resolve object duplication and engages Modified Shifted Window Multi-head Self-Attention (MSW-MSA) that utilizes optimized window attention to reduce computations. we can integrate HiDiffusion into various pretrained diffusion models to scale image generation resolutions even to 4096x4096 at 1.5-6x the inference speed of previous methods. Extensive experiments demonstrate that our approach can address object duplication and heavy computation issues, achieving state-of-the-art performance on higher-resolution image synthesis tasks.

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

TL;DR

with significant speedups. RAU-Net aligns deep-block feature maps via RAD and RAU, while MSW-MSA replaces costly global self-attention with large-window, timestep-shifted local attention, yielding substantial efficiency gains. Across SD 1.5/2.1/XL families and extreme resolutions, HiDiffusion delivers richer details with up to 1.5-6x faster inference compared to prior methods, enabling practical high-resolution diffusion synthesis without additional training.

Abstract

Paper Structure (27 sections, 9 equations, 24 figures, 9 tables)

This paper contains 27 sections, 9 equations, 24 figures, 9 tables.

Introduction
Related Work
Method
Preliminaries
U-Net architecture.
Content generation over timestep.
HiDiffusion
Resolution-Aware U-Net.
Modified Shifted Window Attention
Experiments
Experiment Settings
Main results
Ablation study
Conclusion
Feature Duplication across Inference Step
...and 12 more sections

Figures (24)

Figure 1: 2048$\times$2048 resolution images based on SDXL podell2023sdxl. The first line of the text indicates the generation methods, while the second line indicates the cost time and inference speed relative to direct inference. Our Hidiffusion can generate reasonable and realistic high-resolution images with high efficiency. Compared to previous methods, ours exhibits richer fine-grained details and is 1.58$\times$ faster than Scalerafter he2023scalecrafter, 4.18$\times$ faster than DemoFusion du2023demofusion. Best viewed when zoomed in.
Figure 1: Select HiDiffusion samples for various diffusion models, resolutions, and aspect ratios. HiDiffusion enables pretrained diffusion models to generate higher-resolution images surpassing the training image size without further training or fine-tuning and can effectively accelerate the inference. Best viewed when zoomed in.
Figure 2: Comparison between vanilla Stable Diffusion’s U-Net architecture and our proposed HiDiffusion RAU-Net architecture on 1024$\times$1024 resolution with SD 1.5 rombach2022high. Parameters in all blocks are frozen. The main difference lies in the blue Blocks (differ in the dimensions of feature map) and orange Blocks (Our proposed RAD and RAU modules are incorporated into Block 1.).
Figure 2: The feature map visualization across different inference steps based on SD 1.5. The image resolution is 1024$\times$1024 and we adopt 50 DDIM steps.
Figure 3: The framework of HiDiffusion.
...and 19 more figures

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

TL;DR

Abstract

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (24)