Table of Contents
Fetching ...

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

Yifan Zhang, Bryan Hooi

TL;DR

HiPA delivers a practical solution to the slow inference of diffusion-based text-to-image generation by introducing high-frequency-focused, low-rank adaptor training on frozen models, enabling high-quality one-step generation. The method couples a spatial perceptual loss with a high-frequency loss derived from Fourier components and edge information to specifically promote high-frequency details. Experiments show HiPA substantially improves one-step generation over prior methods while drastically reducing training time and parameter counts, and it extends effectively to editing, inpainting, and super-resolution tasks. The approach highlights the critical role of high-frequency information in rapid diffusion and offers a resource-efficient path toward real-time diffusion-based image synthesis.

Abstract

Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 $\rightarrow$ 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 $\rightarrow$ 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million $\rightarrow$ 3.3 million). We also demonstrate HiPA's effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

TL;DR

HiPA delivers a practical solution to the slow inference of diffusion-based text-to-image generation by introducing high-frequency-focused, low-rank adaptor training on frozen models, enabling high-quality one-step generation. The method couples a spatial perceptual loss with a high-frequency loss derived from Fourier components and edge information to specifically promote high-frequency details. Experiments show HiPA substantially improves one-step generation over prior methods while drastically reducing training time and parameter counts, and it extends effectively to editing, inpainting, and super-resolution tasks. The approach highlights the critical role of high-frequency information in rapid diffusion and offers a resource-efficient path toward real-time diffusion-based image synthesis.

Abstract

Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million 3.3 million). We also demonstrate HiPA's effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.
Paper Structure (35 sections, 13 equations, 21 figures, 13 tables)

This paper contains 35 sections, 13 equations, 21 figures, 13 tables.

Figures (21)

  • Figure 1: Performance of one-step text-to-image diffusion on MS-COCO 2017 lin2014microsoft. We observe that our HiPA performs remarkably well in terms of FID while requiring much less computation time and fewer training parameters.
  • Figure 2: Illustration of text-to-image generation with different diffusion steps based on Stable Diffusion rombach2022high and DPM sampler lu2022dpm. Initially, simple low-frequency components form, followed by complex high-frequency details that increase realism. Notably, one-step diffusion images lack the complex high-frequency components, making them noticeably less realistic.
  • Figure 3: Power Spectral Density analysis of the generated images by Stable Diffusion with different diffusion steps (DPM sampler).
  • Figure 4: Illustration of the impact of high-frequency components in enhancing image clarity for one-step text-to-image diffusion. Combining the high-frequency components from the 15-step images with the the low-frequency components from fewer-step images results in sharper images after Inverse Fourier Transform, while using one-step high-frequency components provides no clarity enhancement.
  • Figure 5: An illustration of our parameter-efficient High-frequency-Promoting Adaptation (HiPA) approach.
  • ...and 16 more figures