High-Resolution Image Editing via Multi-Stage Blended Diffusion
Johannes Ackermann, Minjun Li
TL;DR
The paper tackles the challenge of high-resolution, text-guided image editing with diffusion models by proposing a multi-stage pipeline that edits at low resolution with Blended Diffusion and progressively upscales with super-resolution and Blended Diffusion. It leverages Repaint and CLIP-based reranking to select consistent edits, uses low-pass filtering of the background to stabilize diffusion, and handles memory constraints by tiling and alpha compositing. Compared with baselines such as naive high-resolution Blended Diffusion and DALL-E 2 editing in segments, the approach achieves higher global coherence and visual fidelity on megapixel outputs. The work contributes a practical, scalable framework for megapixel editing with pre-trained diffusion models and provides ablations and guidance on parameter choices and limitations, along with ethical considerations.
Abstract
Diffusion models have shown great results in image generation and in image editing. However, current approaches are limited to low resolutions due to the computational cost of training diffusion models for high-resolution generation. We propose an approach that uses a pre-trained low-resolution diffusion model to edit images in the megapixel range. We first use Blended Diffusion to edit the image at a low resolution, and then upscale it in multiple stages, using a super-resolution model and Blended Diffusion. Using our approach, we achieve higher visual fidelity than by only applying off the shelf super-resolution methods to the output of the diffusion model. We also obtain better global consistency than directly using the diffusion model at a higher resolution.
