Ultra-Resolution Adaptation with Ease
Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang
TL;DR
This work tackles the challenge of adapting diffusion-based text-to-image models to ultra-high resolutions under limited data and compute. It introduces URAE, a set of guidelines that combines data-efficient training via synthetic data with parameter-efficient fine-tuning by targeting minor weight components, plus CFG-training considerations for guided distillation. The authors provide theoretical insights (Theorem 1) and extensive experiments showing synthetic data accelerates convergence, minor-component tuning can outperform LoRA when data are scarce, and disabling classifier-free guidance during adaptation is crucial; URAE matches 2K-generation performance of a leading closed-source model with only 3K synthetic samples and 2K iterations and achieves strong 4K results. The approach reduces data and compute barriers for ultra-resolution diffusion, and remains compatible with training-free high-resolution pipelines, though it acknowledges limits in inference efficiency and outlines avenues for integration with larger multi-modal systems.
Abstract
Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{https://github.com/Huage001/URAE}{here}.
