Table of Contents
Fetching ...

High-Resolution Image Editing via Multi-Stage Blended Diffusion

Johannes Ackermann, Minjun Li

TL;DR

The paper tackles the challenge of high-resolution, text-guided image editing with diffusion models by proposing a multi-stage pipeline that edits at low resolution with Blended Diffusion and progressively upscales with super-resolution and Blended Diffusion. It leverages Repaint and CLIP-based reranking to select consistent edits, uses low-pass filtering of the background to stabilize diffusion, and handles memory constraints by tiling and alpha compositing. Compared with baselines such as naive high-resolution Blended Diffusion and DALL-E 2 editing in segments, the approach achieves higher global coherence and visual fidelity on megapixel outputs. The work contributes a practical, scalable framework for megapixel editing with pre-trained diffusion models and provides ablations and guidance on parameter choices and limitations, along with ethical considerations.

Abstract

Diffusion models have shown great results in image generation and in image editing. However, current approaches are limited to low resolutions due to the computational cost of training diffusion models for high-resolution generation. We propose an approach that uses a pre-trained low-resolution diffusion model to edit images in the megapixel range. We first use Blended Diffusion to edit the image at a low resolution, and then upscale it in multiple stages, using a super-resolution model and Blended Diffusion. Using our approach, we achieve higher visual fidelity than by only applying off the shelf super-resolution methods to the output of the diffusion model. We also obtain better global consistency than directly using the diffusion model at a higher resolution.

High-Resolution Image Editing via Multi-Stage Blended Diffusion

TL;DR

The paper tackles the challenge of high-resolution, text-guided image editing with diffusion models by proposing a multi-stage pipeline that edits at low resolution with Blended Diffusion and progressively upscales with super-resolution and Blended Diffusion. It leverages Repaint and CLIP-based reranking to select consistent edits, uses low-pass filtering of the background to stabilize diffusion, and handles memory constraints by tiling and alpha compositing. Compared with baselines such as naive high-resolution Blended Diffusion and DALL-E 2 editing in segments, the approach achieves higher global coherence and visual fidelity on megapixel outputs. The work contributes a practical, scalable framework for megapixel editing with pre-trained diffusion models and provides ablations and guidance on parameter choices and limitations, along with ethical considerations.

Abstract

Diffusion models have shown great results in image generation and in image editing. However, current approaches are limited to low resolutions due to the computational cost of training diffusion models for high-resolution generation. We propose an approach that uses a pre-trained low-resolution diffusion model to edit images in the megapixel range. We first use Blended Diffusion to edit the image at a low resolution, and then upscale it in multiple stages, using a super-resolution model and Blended Diffusion. Using our approach, we achieve higher visual fidelity than by only applying off the shelf super-resolution methods to the output of the diffusion model. We also obtain better global consistency than directly using the diffusion model at a higher resolution.
Paper Structure (16 sections, 1 equation, 7 figures, 1 algorithm)

This paper contains 16 sections, 1 equation, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Our approach performs high-resolution text-guided image editing in multiple stages. In the first stage a), we apply Blended Diffusion avrahami2022blended, given a masked region and a text prompt. In each following stage b), we first upscale the image using an off the shelf super-resolution model and then use Blended Diffusion, starting at an intermediate diffusion step, to improve the image quality and ensure consistency with the input prompt. c) When the output resolution of a stage is too large to fit into the GPU memory, we split the image into multiple segments, apply upscaling and Blended Diffusion to them separately and alpha-composite the results.
  • Figure 2: Comparison of our approach to two baselines for the prompt “Statue of Roman Emperor, Canon 5D Mark 3, 35mm, flickr”. From left to right: Blended Diffusion applied to 960x960 pixels followed by bilinear upscaling, Dall-E 2 editing in multiple segments, our proposed approach. The size of the mask is 1166x2297 pixels. Applying Blended Diffusion directly to the higher resolution leads to incoherent generation with repeated elements. Similarly, Dall-E 2 generates two statues, with one floating above the other. Our approach is able to generate a detailed, coherent image.
  • Figure 3: Our approach, with all intermediate results being shown. Best viewed zoomed in.
  • Figure 4: Comparison of our approach to two baselines. Left: we directly apply Blended Diffusion to the highest resolution we can fit into the VRAM (960x960 pixels) and then bilinearly upscale the output. Middle: We use the Dall-E 2 web UI to edit the image at its full resolution. Due to the edited region being larger than the 1024x1024 generation window, we have to apply Dall-E to multiple independent segments. Right: Our proposed approach. We find that directly applying Blended Diffusion leads to repeated elements (two heads, two mountains, two pictures) and fails to produce fine details (hair). Dall-E 2 produces visually high-fidelity images, but fails to produce globally coherent images (floating statues, four paintings). Our method produces globally consistent images while providing a similar visual fidelity. Note that the full images are downscaled. The zoomed-in regions measure 512x512 pixels and are shown at full resolution.
  • Figure 5: Ablation of different upscaling methods, applied after the Blended Diffusion in the first stage with fixed seeds. a) Bilinear upscaling, b) ESRGAN, c) ESRGAN + unconditional diffusion, d) ESRGAN + text-conditional diffusion, e) ESRGAN + text-conditional Blended Diffusion, f) ESRGAN + text-conditional Blended Diffusion with a low-pass filtered background. Note that the images are downscaled from the full resolution.
  • ...and 2 more figures