Table of Contents
Fetching ...

Ultra-Resolution Adaptation with Ease

Ruonan Yu, Songhua Liu, Zhenxiong Tan, Xinchao Wang

TL;DR

This work tackles the challenge of adapting diffusion-based text-to-image models to ultra-high resolutions under limited data and compute. It introduces URAE, a set of guidelines that combines data-efficient training via synthetic data with parameter-efficient fine-tuning by targeting minor weight components, plus CFG-training considerations for guided distillation. The authors provide theoretical insights (Theorem 1) and extensive experiments showing synthetic data accelerates convergence, minor-component tuning can outperform LoRA when data are scarce, and disabling classifier-free guidance during adaptation is crucial; URAE matches 2K-generation performance of a leading closed-source model with only 3K synthetic samples and 2K iterations and achieves strong 4K results. The approach reduces data and compute barriers for ultra-resolution diffusion, and remains compatible with training-free high-resolution pipelines, though it acknowledges limits in inference efficiency and outlines avenues for integration with larger multi-modal systems.

Abstract

Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{https://github.com/Huage001/URAE}{here}.

Ultra-Resolution Adaptation with Ease

TL;DR

This work tackles the challenge of adapting diffusion-based text-to-image models to ultra-high resolutions under limited data and compute. It introduces URAE, a set of guidelines that combines data-efficient training via synthetic data with parameter-efficient fine-tuning by targeting minor weight components, plus CFG-training considerations for guided distillation. The authors provide theoretical insights (Theorem 1) and extensive experiments showing synthetic data accelerates convergence, minor-component tuning can outperform LoRA when data are scarce, and disabling classifier-free guidance during adaptation is crucial; URAE matches 2K-generation performance of a leading closed-source model with only 3K synthetic samples and 2K iterations and achieves strong 4K results. The approach reduces data and compute barriers for ultra-resolution diffusion, and remains compatible with training-free high-resolution pipelines, though it acknowledges limits in inference efficiency and outlines avenues for integration with larger multi-modal systems.

Abstract

Text-to-image diffusion models have achieved remarkable progress in recent years. However, training models for high-resolution image generation remains challenging, particularly when training data and computational resources are limited. In this paper, we explore this practical problem from two key perspectives: data and parameter efficiency, and propose a set of key guidelines for ultra-resolution adaptation termed \emph{URAE}. For data efficiency, we theoretically and empirically demonstrate that synthetic data generated by some teacher models can significantly promote training convergence. For parameter efficiency, we find that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable, offering substantial performance gains while maintaining efficiency. Additionally, for models leveraging guidance distillation, such as FLUX, we show that disabling classifier-free guidance, \textit{i.e.}, setting the guidance scale to 1 during adaptation, is crucial for satisfactory performance. Extensive experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations, while setting new benchmarks for 4K-resolution generation. Codes are available \href{https://github.com/Huage001/URAE}{here}.

Paper Structure

This paper contains 25 sections, 2 theorems, 36 equations, 9 figures, 4 tables.

Key Result

Theorem 2.4

Under the setting defined in Assumptions ass:1, ass:2, and ass:3, the error between $W_T$, the parameters after $T$ training iterations, and the optimal $W^*$ is bounded by: where $\Delta_0=W_0-(pW_{ref}+(1-p)W^*)$, $M$ is defined as $\nabla_{W}f(U;W_0)^\top\nabla_{W}f(U;W_0)$, $\delta=f(u;W_{ref})-f(u;W^*)$, and $\lambda_i$ is the $i$-th eigenvalue of $M$.

Figures (9)

  • Figure 1: High-resolution results by our method.
  • Figure 2: A toy linear regression case. There are real data with noisy labels and synthetic data generated by a reference model $W_{ref}$. The proportion of synthetic data is $p$.
  • Figure 3: For CFG-distilled models, classifier-free guidance should be disabled in the training time. $z_t$ and $t$ are omitted from the inputs of $\epsilon_{\theta}$ and $\epsilon_{\theta'}$ here for simplicity.
  • Figure 4: GPT-4o preferred evaluation against current SOTA T2I models. We request GPT-4o to select a better image regarding overall quality, prompt alignment, and visual aesthetics. Our proposed method are preferred against others.
  • Figure 5: Visualizations of our proposed method apply to training-free high-resolution generation pipelines. The prompt is A giraffe stands beneath a tree beside a marina.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Theorem 2.4
  • Theorem 2.1
  • proof