Table of Contents
Fetching ...

ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation

Moayed Haji-Ali, Guha Balakrishnan, Vicente Ordonez

TL;DR

ElasticDiffusion presents a training-free decoding strategy that enables a pretrained diffusion model to generate images at arbitrary resolutions and aspect ratios by decoupling global content guidance from local pixel-level details. It estimates local unconditional scores on patches with contextual information and derives a global class-direction score from a downsampled reference latent, which is upscaled to the target size; a refined, iterative resampling of the global score and a Reduced-Resolution Guidance mechanism further stabilize outputs. The approach yields coherent images across diverse sizes on CelebA-HQ and LAION-COCO, with competitive FID and CLIP scores and favorable memory footprints compared to SDXL. While effective across a wide range of sizes, it acknowledges limitations at extreme resolutions and in complex prompts, and suggests avenues for broader applicability and further disentanglement of global/local signals.

Abstract

Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project page: https://elasticdiffusion.github.io/

ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation

TL;DR

ElasticDiffusion presents a training-free decoding strategy that enables a pretrained diffusion model to generate images at arbitrary resolutions and aspect ratios by decoupling global content guidance from local pixel-level details. It estimates local unconditional scores on patches with contextual information and derives a global class-direction score from a downsampled reference latent, which is upscaled to the target size; a refined, iterative resampling of the global score and a Reduced-Resolution Guidance mechanism further stabilize outputs. The approach yields coherent images across diverse sizes on CelebA-HQ and LAION-COCO, with competitive FID and CLIP scores and favorable memory footprints compared to SDXL. While effective across a wide range of sizes, it acknowledges limitations at extreme resolutions and in complex prompts, and suggests avenues for broader applicability and further disentanglement of global/local signals.

Abstract

Diffusion models have revolutionized image generation in recent years, yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion, a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches, while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project page: https://elasticdiffusion.github.io/
Paper Structure (22 sections, 9 equations, 20 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 9 equations, 20 figures, 5 tables, 1 algorithm.

Figures (20)

  • Figure 1: ElasticDiffusion generates high quality images at arbitrary sizes using a pretrained diffusion model trained on a single image size, with equivalent memory footprint and no further training. These results are based on $\text{Stable Diffusion}_{1.4}$, which was trained to generate $512 \times 512$ images. The examples shown in this collage are presented without any image cropping, stretching, or post-processing.
  • Figure 2: PCA of diffusion scores: class-direction score (top) dictates global content by clustering on semantic parts, while the unconditional score (bottom) lacks pixel correlations.
  • Figure 3: Illustration of ElasticDiffusion: We generate images at various sizes by generating local and global content separately. For local content, we partition the latent $\bar{x}_t$ into non-overlapping patches $p_k$, each concatenated with context $c_k$ to estimate their unconditional score. For global content, we downsample $\bar{x}_t$ to $\mathbf{x}_t$, pad to a square size ($\hat{\mathbf{x}}_t$), compute class-direction score ($\Delta_c$), and upscale to match $\bar{x}_t$.
  • Figure 4: Comparing strategies for calculating diffusion model score on a local patch. No overlap between adjacent patches (A) leads to discontinuities at the boundaries. Strategies (B) and (C), explicitly overlap nearby patches, necessitating substantial overlap to be effective. Our implicit overlapping method (D) achieves superior results with computational demand similar to (B).
  • Figure 5: The Effect of Reduced-Resolution Guidance (RRG). Higher RRG weights effectively eliminates emerging artifacts albeit at the cost of slightly blurrier outputs. $\delta = 200$ strikes a good balance. Improvements are more noticeable when zooming in.
  • ...and 15 more figures