Table of Contents
Fetching ...

Lazy Diffusion Transformer for Interactive Image Editing

Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi

TL;DR

The paper addresses interactive image editing with diffusion models by reducing computation to the masked region rather than the full image. It introduces LazyDiffusion, a two-stage architecture with a global context encoder that compresses the entire canvas into a small set of tokens and a diffusion transformer decoder that denoises only the masked region conditioned on this context and a text prompt. This design yields runtime that scales with the mask size, enabling up to around a $\times10$ speedup for typical $10\%$ masks while maintaining image fidelity comparable to full-image inpainting. Extensive experiments show competitive quality (FID/CLIP) and strong user preference over crop-based baselines, with practical benefits for interactive, multi-step edits and support for sketch-guided conditioning.

Abstract

We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.

Lazy Diffusion Transformer for Interactive Image Editing

TL;DR

The paper addresses interactive image editing with diffusion models by reducing computation to the masked region rather than the full image. It introduces LazyDiffusion, a two-stage architecture with a global context encoder that compresses the entire canvas into a small set of tokens and a diffusion transformer decoder that denoises only the masked region conditioned on this context and a text prompt. This design yields runtime that scales with the mask size, enabling up to around a speedup for typical masks while maintaining image fidelity comparable to full-image inpainting. Extensive experiments show competitive quality (FID/CLIP) and strong user preference over crop-based baselines, with practical benefits for interactive, multi-step edits and support for sketch-guided conditioning.

Abstract

We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.
Paper Structure (25 sections, 6 equations, 14 figures, 2 tables)

This paper contains 25 sections, 6 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Incremental image generation at $1024\times1024$ using LazyDiffusion with 20 diffusion steps. The model generates content according to a text prompt in an area specified by a mask. Each update generates only the masked pixels, with a runtime that depends chiefly on the size of the mask, rather than that of the image.
  • Figure 2: Comparing inpainting approaches. (a) Most works rombach2022highpodell2023sdxl generate the entire image, utilizing the full image context and fill the hole by discarding the non-masked regions. While the outcome aligns well with the image, the process is time-consuming. (b) generating only a lower resolution crop around the mask is more efficient and still seamlessly blends with nearby pixels sd-webuidiffusers. However, the inpainted content is semantically inconsistent with the overall image context. (c) our approach ensures both global consistency and efficient execution.
  • Figure 3: Our diffusion transformer decoder (bottom) reduces synthesis computation using two strategies. First, we compress the image context using a separate encoder (not shown) outside the diffusion loop. Second, we only generate tokens corresponding to the masked region to generate. In contrast, typical diffusion transformers (top) peebles2023scalablechen2023pixartalpha maintain tokens for the entire image throughout the diffusion process, to preserve global context. When performing inpainting, such model generates a full-size image, most of which is discarded in order to in-fill the hole region only. Existing convolutional diffusion models for inpainting rombach2022high suffer from the same drawbacks.
  • Figure 4: Overview. To generate an incremental image update, our algorithm takes as input a user mask and a text prompt. (top) We start by transforming the visible pixels and binary mask into patches, and pass them to a vision transformer (ViT) encoder. We then drop all tokens, except those corresponding to the hole region; this is our global context. (bottom) To generate the missing pixels, we initialize a set of noise patches corresponding to the masked region and pass them through a diffusion transformer model for several denoising iterations, until we obtain denoised patches. Unlike previous works peebles2023scalablechen2023pixartalpha, which process the entire image, our diffusion transformer only processes the patches required to cover the missing region. We train our encoder and diffusion decoder jointly using a diffusion denoising objective on the missing patches. The generated patches are then blended back into the missing region to produce the final output. Our model operates in a pretrained latent image space rombach2022high, but we illustrate our pipeline with RGB images for simplicity.
  • Figure 5: Comparing LazyDiffusion's runtime to that of baselines regenerating the entire $1024\times1024$ image or a smaller $512\times512$ crop around the mask. LazyDiffusion is consistently faster than RegenerateImage, especially for small mask ratios typical to interactive edits, reaching a speedup of $10\times$. Similarly, LazyDiffusion is faster than RegenerateCrop for mask ratios $<25\%$. For masks greater than that (dashed), RegenerateCrop is technically faster but generates in low-resolution and naively upsamples to match the desired resolution, harming image quality.
  • ...and 9 more figures