Table of Contents
Fetching ...

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference

Zihao Yu, Haoyang Li, Fangcheng Fu, Xupeng Miao, Bin Cui

TL;DR

Diffusion-based text-to-image editing is powerful but often computationally prohibitive when iteratively refining prompts. The authors present FISEdit, a cache-enabled sparse inference framework that automatically detects affected regions via Target Area Capture and refines only those regions using caches of prior activations, supplemented by Adaptive Pixel-Wise Sparse Convolution, Approximate Normalization, and Approximate Attention. The approach includes a cache-based editing pipeline to manage data movement and a fine-grained mask generation strategy, achieving substantial speedups while preserving or improving edit fidelity. Evaluations on LAION-Aesthetics with Stable Diffusion demonstrate up to $4.9\times$ speedups in MACs, with $4.4\times$ on TITAN RTX and $3.4\times$ on A100, indicating strong practical potential for interactive, scalable text-to-image editing. This work paves the way for real-world, high-throughput T2I editing services that reuse prior computations and focus resources on edited regions.

Abstract

Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators. To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process. Extensive empirical results show that FISEdit can be $3.4\times$ and $4.4\times$ faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images.

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference

TL;DR

Diffusion-based text-to-image editing is powerful but often computationally prohibitive when iteratively refining prompts. The authors present FISEdit, a cache-enabled sparse inference framework that automatically detects affected regions via Target Area Capture and refines only those regions using caches of prior activations, supplemented by Adaptive Pixel-Wise Sparse Convolution, Approximate Normalization, and Approximate Attention. The approach includes a cache-based editing pipeline to manage data movement and a fine-grained mask generation strategy, achieving substantial speedups while preserving or improving edit fidelity. Evaluations on LAION-Aesthetics with Stable Diffusion demonstrate up to speedups in MACs, with on TITAN RTX and on A100, indicating strong practical potential for interactive, scalable text-to-image editing. This work paves the way for real-world, high-throughput T2I editing services that reuse prior computations and focus resources on edited regions.

Abstract

Due to the recent success of diffusion models, text-to-image generation is becoming increasingly popular and achieves a wide range of applications. Among them, text-to-image editing, or continuous text-to-image generation, attracts lots of attention and can potentially improve the quality of generated images. It's common to see that users may want to slightly edit the generated image by making minor modifications to their input textual descriptions for several rounds of diffusion inference. However, such an image editing process suffers from the low inference efficiency of many existing diffusion models even using GPU accelerators. To solve this problem, we introduce Fast Image Semantically Edit (FISEdit), a cached-enabled sparse diffusion model inference engine for efficient text-to-image editing. The key intuition behind our approach is to utilize the semantic mapping between the minor modifications on the input text and the affected regions on the output image. For each text editing step, FISEdit can automatically identify the affected image regions and utilize the cached unchanged regions' feature map to accelerate the inference process. Extensive empirical results show that FISEdit can be and faster than existing methods on NVIDIA TITAN RTX and A100 GPUs respectively, and even generates more satisfactory images.
Paper Structure (26 sections, 2 equations, 9 figures, 3 tables)

This paper contains 26 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: A real example of the user's interaction with FISEdit and existing methods.
  • Figure 1: In real-world scenarios, users may desire the ability to provide masks and selectively modify only the regions within the mask that they consider unsatisfactory. As shown in the provided example, The user specifies different masks and edits prompt from "Mountaineering Wallpapers" to "Mountaineering Wallpapers under fireworks".
  • Figure 2: Overview structure of FISEdit. When a query arrives, our system first executes $k$ denoise steps, and then generates a difference mask according to the output latents of $k$ steps. In the remaining denoise steps, the pre-computed results (activations and parameters) of each layers in U-Net will be reused according to the mask and the feature maps will be computed sparsely. Compared to existing frameworks which leverage the batched inputs, we collect and cache the results of previous generation to avoid redundant computation, and use the mask to control as well as accelerate the new T2I generation process at U-Net level.
  • Figure 2: Visualization of masks generated with (lower) and without (upper) cross-attention control. For both, the corresponding prompts are “A dog is sitting on the sofa” and “A dog is sitting on the sofa with a hat on its head”.
  • Figure 3: Variation of latent difference with iteration steps.
  • ...and 4 more figures