Table of Contents
Fetching ...

Efficient Training-Free High-Resolution Synthesis with Energy Rectification in Diffusion Models

Zhen Yang, Guibao Shen, Minyang Li, Liang Hou, Mushui Liu, Luozhou Wang, Xin Tao, Pengfei Wan, Di Zhang, Ying-Cong Chen

TL;DR

RectifiedHR provides a training-free approach to high-resolution diffusion-based image synthesis by introducing a noise refresh scheme that progressively increases resolution during sampling and an energy rectification mechanism that counteracts energy decay and blur. The method is lightweight, compatible with multiple diffusion-model techniques, and demonstrates superior efficiency and fidelity at resolutions up to $4096 imes4096$ compared with existing training-free baselines. Through extensive quantitative and qualitative experiments on SDXL and cross-model applications, the authors show robust improvements in FID, KID, IS, and CLIP scores while maintaining practical runtimes. The work also explores broader applications, including video generation, image editing, customization, and controllable generation, highlighting RectifiedHR’s versatility and potential impact for scalable high-resolution diffusion synthesis.

Abstract

Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content at resolutions higher than those used during training. Although numerous methods have been proposed to enable high-resolution generation, they all suffer from inefficiency. In this paper, we propose RectifiedHR, a straightforward and efficient solution for training-free high-resolution synthesis. Specifically, we propose a noise refresh strategy that unlocks the model's training-free high-resolution synthesis capability and improves efficiency. Additionally, we are the first to observe the phenomenon of energy decay, which may cause image blurriness during the high-resolution synthesis process. To address this issue, we introduce average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance. Our method is entirely training-free and demonstrates efficient performance. Furthermore, we show that RectifiedHR is compatible with various diffusion model techniques, enabling advanced features such as image editing, customized generation, and video synthesis. Extensive comparisons with numerous baseline methods validate the superior effectiveness and efficiency of RectifiedHR.

Efficient Training-Free High-Resolution Synthesis with Energy Rectification in Diffusion Models

TL;DR

RectifiedHR provides a training-free approach to high-resolution diffusion-based image synthesis by introducing a noise refresh scheme that progressively increases resolution during sampling and an energy rectification mechanism that counteracts energy decay and blur. The method is lightweight, compatible with multiple diffusion-model techniques, and demonstrates superior efficiency and fidelity at resolutions up to compared with existing training-free baselines. Through extensive quantitative and qualitative experiments on SDXL and cross-model applications, the authors show robust improvements in FID, KID, IS, and CLIP scores while maintaining practical runtimes. The work also explores broader applications, including video generation, image editing, customization, and controllable generation, highlighting RectifiedHR’s versatility and potential impact for scalable high-resolution diffusion synthesis.

Abstract

Diffusion models have achieved remarkable progress across various visual generation tasks. However, their performance significantly declines when generating content at resolutions higher than those used during training. Although numerous methods have been proposed to enable high-resolution generation, they all suffer from inefficiency. In this paper, we propose RectifiedHR, a straightforward and efficient solution for training-free high-resolution synthesis. Specifically, we propose a noise refresh strategy that unlocks the model's training-free high-resolution synthesis capability and improves efficiency. Additionally, we are the first to observe the phenomenon of energy decay, which may cause image blurriness during the high-resolution synthesis process. To address this issue, we introduce average latent energy analysis and find that tuning the classifier-free guidance hyperparameter can significantly improve generation performance. Our method is entirely training-free and demonstrates efficient performance. Furthermore, we show that RectifiedHR is compatible with various diffusion model techniques, enabling advanced features such as image editing, customized generation, and video synthesis. Extensive comparisons with numerous baseline methods validate the superior effectiveness and efficiency of RectifiedHR.

Paper Structure

This paper contains 24 sections, 16 equations, 19 figures, 3 tables.

Figures (19)

  • Figure 1: Generated images by RectifiedHR. The training-free RectifiedHR enables diffusion models (SDXL is shown in the figure) to synthesize images at resolutions exceeding their original training resolution. Please zoom in for a closer view.
  • Figure 2: The visualization images corresponding to "predicted $x_0$" at different time step t, abbreviated as $p_{x_0}^t$. The figure visualizes the process of how $p_{x_0}^t$ changes with the sampling steps, where the x-axis represents the timestep in the sampling process. The 11 images are evenly extracted from 50 steps. It can be observed that in the first half of the process, $p_{x_0}^t$ is mainly responsible for global structure generation, while the second half is mainly responsible for local detail generation. Moreover, later in the sampling process, the image corresponding to $p_{x_0}^t$ exhibits the characteristics of an RGB image.
  • Figure 3: (a) The x-axis denotes the timesteps of the sampling process, and the y-axis indicates the average latent energy. The blue line shows the average latent energy of the original sampling process when generating $1024\times1024$-resolution images. The red line corresponds to our noise refresh sampling process, where noise refresh is applied at the 30th and 40th timesteps, and the resolution progressively increases from $1024\times1024$ to $2048\times2048$, and subsequently to $3072\times3072$. It can be observed that noise refresh induces a noticeable decay in average latent energy. From the left images, it is evident that after energy rectification, image details become more pronounced. (b) The x-axis represents the timestep, the y-axis represents the average latent energy, and $\omega$ denotes the hyperparameter for classifier-free guidance. It can be observed that the average latent energy increases as $\omega$ increases. From the right figures, one can observe how the generated images vary with increasing $\omega$.
  • Figure 4: Overview of RectifiedHR. (a) The original sampling process and its pseudocode. (b) The sampling process and pseudocode of our method. The orange components in the pseudocode and modules correspond to Noise Refresh, while the purple components represent Energy Rectification. $\textcolor{orange}{\epsilon}$ denotes Gaussian random noise, whose shape adapts to that of $\textcolor{orange}{\tilde{p}_{x_0}^t}$. The definitions of other symbols used in the pseudocode can be found in Sec. \ref{['sec:sec_3_1']}.
  • Figure 5: Qualitative comparison between our method and SDXL+BSRGAN at a resolution of $2048\times2048$.
  • ...and 14 more figures