Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Athanasios Tragakis; Marco Aversa; Chaitanya Kaul; Roderick Murray-Smith; Daniele Faccio

Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Athanasios Tragakis, Marco Aversa, Chaitanya Kaul, Roderick Murray-Smith, Daniele Faccio

Abstract

In this work, we introduce Pixelsmith, a zero-shot text-to-image generative framework to sample images at higher resolutions with a single GPU. We are the first to show that it is possible to scale the output of a pre-trained diffusion model by a factor of 1000, opening the road for gigapixel image generation at no additional cost. Our cascading method uses the image generated at the lowest resolution as a baseline to sample at higher resolutions. For the guidance, we introduce the Slider, a tunable mechanism that fuses the overall structure contained in the first-generated image with enhanced fine details. At each inference step, we denoise patches rather than the entire latent space, minimizing memory demands such that a single GPU can handle the process, regardless of the image's resolution. Our experimental results show that Pixelsmith not only achieves higher quality and diversity compared to existing techniques, but also reduces sampling time and artifacts. The code for our work is available at https://github.com/Thanos-DB/Pixelsmith.

Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Abstract

Paper Structure (39 sections, 3 equations, 17 figures, 3 tables)

This paper contains 39 sections, 3 equations, 17 figures, 3 tables.

Introduction
Related Work
Trained Models
Adapted Models
Foundations
Diffusion models
Patch sampling
Method
Problem statement
Framework overview
Text-to-image generation
Upsampling process
Image guidance preparation
Image generation
Higher generation in one step
...and 24 more sections

Figures (17)

Figure 1: Examples of generated images using Pixelsmith. The proposed framework generates images on higher-resolutions than the pre-trained model without any fine-tuning. Images at different resolutions are shown with cut-out areas for both Pixelsmith and the base model. The higher-resolution images are in scale with the images generated by the base model. Only the lower resolution version of the gigapixel image has been resized for a better visualisation. Some cut-outs of the gigapixel generation have resolution close to the base model which is $1024^2$ and it can be seen that the images are comparable in aesthetics showing that our framework is capable of true gigapixel generations (zoom in to see in better detail).
Figure 2: Overview of the patch denoising process proposed by DiffInfinite: The top row represents the latent space, while the bottom row tracks the timesteps for each pixel. Each pixel should be denoised only once per timestep, so when overlapping occurs, already denoised pixels revert to their previous values from the prior timestep. After denoising, these reverted pixels are restored to their original denoised state from the current timestep.
Figure 3: Proposed framework overview. 1. Text-to-image Generation: A pre-trained text-to-image diffusion model generates an initial image based on the input text prompt. 2. Upsampling process: The generated image is upscaled (in this use case by a factor x4) and encoded into the latent space to guide the creation of a higher-resolution image. 3. Image guidance preparation: The encoded image is degraded through the diffusive forward model, creating the guidance latents. 4. Image generation: the Slider (indicated by a blue line) adjusts the extent of guidance. Left of Slider (Guided Generation): guidance latents control the image generation. The framework fuses guidance latents (green patches) with high-resolution latents (purple patches) using the Fast Fourier Transformation (FFT). The phases are averaged and combined with the amplitude, then transformed back via the inverse FFT (iFFT). A chess-like mask integrates information from the successive guidance step (orange), resulting in fully processed patches (cyan). Right of Slider (Pure Generation): the generation relies only on the prompt. Higher-Resolution Comparison: while the base model upscales the bust with disfigured hands, the proposed method enhances details, corrects distortions, and prevents new artifacts.
Figure 4: Masking effects on higher-resolution generation ($\times 16$ the original resolution). (left) Image generated using SDXL. (center) Image generated with Pixelsmith with masking. (right) Image generated with Pixelsmith without masking. We highlighted the artifacts introduced by generating at higher scales. These artifacts demonstrate the challenges of maintaining coherence and accuracy when scaling up the resolution without additional guidance.
Figure 5: Qualitative comparisons: This figure highlights how other models suffer from duplications (red arrows) and introduce artifacts in areas with complex, high-frequency patterns (purple arrows). In contrast, Pixelsmith effectively eliminates these issues. (zoom in to see in better detail).
...and 12 more figures

Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Abstract

Is One GPU Enough? Pushing Image Generation at Higher-Resolutions with Foundation Models

Authors

Abstract

Table of Contents

Figures (17)