HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation
Yifan Zhang, Bryan Hooi
TL;DR
HiPA delivers a practical solution to the slow inference of diffusion-based text-to-image generation by introducing high-frequency-focused, low-rank adaptor training on frozen models, enabling high-quality one-step generation. The method couples a spatial perceptual loss with a high-frequency loss derived from Fourier components and edge information to specifically promote high-frequency details. Experiments show HiPA substantially improves one-step generation over prior methods while drastically reducing training time and parameter counts, and it extends effectively to editing, inpainting, and super-resolution tasks. The approach highlights the critical role of high-frequency information in rapid diffusion and offers a resource-efficient path toward real-time diffusion-based image synthesis.
Abstract
Diffusion models have revolutionized text-to-image generation, but their real-world applications are hampered by the extensive time needed for hundreds of diffusion steps. Although progressive distillation has been proposed to speed up diffusion sampling to 2-8 steps, it still falls short in one-step generation, and necessitates training multiple student models, which is highly parameter-extensive and time-consuming. To overcome these limitations, we introduce High-frequency-Promoting Adaptation (HiPA), a parameter-efficient approach to enable one-step text-to-image diffusion. Grounded in the insight that high-frequency information is essential but highly lacking in one-step diffusion, HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. The learned adaptors empower these diffusion models to generate high-quality images in just a single step. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation (37.3 $\rightarrow$ 23.8 in FID-5k on MS-COCO 2017) and 28.6x training speed-up (108.8 $\rightarrow$ 3.8 A100 GPU days), requiring only 0.04% training parameters (7,740 million $\rightarrow$ 3.3 million). We also demonstrate HiPA's effectiveness in text-guided image editing, inpainting and super-resolution tasks, where our adapted models consistently deliver high-quality outputs in just one diffusion step. The source code will be released.
