Table of Contents
Fetching ...

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang

TL;DR

This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image models towards efficient higher-resolution generation without additional fine-tuning or adaptation, and employs an innovative truncate and relay strategy to bridge the denoising processes across different resolutions.

Abstract

Diffusion models have emerged as frontrunners in text-to-image generation, but their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic deviations and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image models towards efficient higher-resolution generation without additional fine-tuning or adaptation. Specifically, we employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions, allowing for high-resolution image generation in a coarse-to-fine manner. Moreover, by integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution. The versatility and efficacy of MegaFusion make it universally applicable to both latent-space and pixel-space diffusion models, along with other derivative models. Extensive experiments confirm that MegaFusion significantly boosts the capability of existing models to produce images of megapixels and various aspect ratios, while only requiring about 40% of the original computational cost.

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

TL;DR

This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image models towards efficient higher-resolution generation without additional fine-tuning or adaptation, and employs an innovative truncate and relay strategy to bridge the denoising processes across different resolutions.

Abstract

Diffusion models have emerged as frontrunners in text-to-image generation, but their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic deviations and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image models towards efficient higher-resolution generation without additional fine-tuning or adaptation. Specifically, we employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions, allowing for high-resolution image generation in a coarse-to-fine manner. Moreover, by integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution. The versatility and efficacy of MegaFusion make it universally applicable to both latent-space and pixel-space diffusion models, along with other derivative models. Extensive experiments confirm that MegaFusion significantly boosts the capability of existing models to produce images of megapixels and various aspect ratios, while only requiring about 40% of the original computational cost.
Paper Structure (35 sections, 6 equations, 21 figures, 7 tables)

This paper contains 35 sections, 6 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Overview.Left: Existing diffusion-based text-to-image models fall short in synthesizing higher-resolution images due to the fixed image resolution during training, resulting in a noticeable decline in image quality and semantic deviation. Right: Our proposed tuning-free MegaFusion can effectively and efficiently extend diffusion models (e.g. SDM SDM, SDXL podell2023sdxl and Floyd DeepFloyd) towards generating images at higher resolutions (e.g., $1024 \times 1024, 1920 \times 1080, 2048 \times 1536,$ and $2048 \times 2048$) of arbitrary aspect ratios (e.g., $1:1, 16:9,$ and $4:3$). We recommend the reader to zoom in for the visualization results.
  • Figure 2: Architecture Overview. (a) The Truncate and Relay strategy in MegaFusion seamlessly connects generation processes across different resolutions to produce higher-resolution images without extra tuning, exemplified by a three-stage pipeline. For pixel-space models, the VAE encoder and decoder can be directly removed. (b) Limited receptive fields lead to quality decline and object replication. Dilated convolutions expand the receptive field at higher resolutions, enabling the model to capture more global information for more accurate semantics and image details. (c) Noise at identical timesteps affects images of different resolutions differently, deviating from the model's prior. Noise re-scheduling helps align the noise level of higher-resolution images with that of the original resolution.
  • Figure 3: Qualitative results of applying our MegaFusion to both latent-space and pixel-space diffusion models for higher-resolution image generation on MS-COCO and commonly used prompts from the Internet. Our method can effectively extend existing diffusion-based models towards synthesizing higher-resolution images of megapixels with correct semantics and details.
  • Figure 4: Qualitative results of incorporating MegaFusion to models with extra conditional inputs. MegaFusion can be universally applied across various diffusion models, providing the capability for higher-resolution image generation with better semantics and fidelity.
  • Figure 5: Ablation study of classifier-free guidance (CFG) weight on SDM-MegaFusion and SDXL-MegaFusion.
  • ...and 16 more figures