Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

Siyu Zou; Jiji Tang; Yiyi Zhou; Jing He; Chaoyi Zhao; Rongsheng Zhang; Zhipeng Hu; Xiaoshuai Sun

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

Siyu Zou, Jiji Tang, Yiyi Zhou, Jing He, Chaoyi Zhao, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun

TL;DR

This paper tackles efficient semantic image editing with diffusion models by generating instant target masks from cross-attention during denoising. It introduces InstDiffEdit, a training-free method that refines attention-based masks and guides diffusion updates, with an inpainting fallback to improve global consistency. A new Editing-Mask benchmark is proposed to assess mask accuracy and local editing ability, and experiments on ImageNet and Imagen show 5–6x faster inference and better editing quality compared to DiffEdit. The approach is plug-and-play for latent-diffusion models and enhances the practicality of diffusion-based image editing by combining fast mask generation, robust refinement, and local-global editing strategies.

Abstract

Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times.

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

TL;DR

Abstract

Paper Structure (26 sections, 12 equations, 11 figures, 3 tables)

This paper contains 26 sections, 12 equations, 11 figures, 3 tables.

Introduction
Related Work
Text-to-Image Diffusion
Semantic Image Editing
Preliminary
Latent Diffusion Models
Cross-Attention in LDMs
Methodology
Overview
Instant Attention Mask Generation
Semantic Editing via Mask
Experiments
Experiment Setting
Datasets
Metrics
...and 11 more sections

Figures (11)

Figure 1: Illustration of existing diffusion-based image editing methods, where a manually or off-line generated mask is often used to control the editing area.
Figure 2: The visualization of the attention maps in Stable Diffusion. The target word of "cat" has the best attention map, but it needs to be manually identified during applications. The start token is relevant but still very noisy.
Figure 3: The framework of the Instant Diffusion Editing (InstDiffEdit). InstDiffEdit involves instant mask generation at each denoising step based on the attention maps. This mask can provide instant guidance for the image denoising. The left part (a) illustrates the noise process, and (b) depicts the generation of semantic mask at each step, based on which the diffusion-based image editing is performed (c). Lastly, the inpainting model is further applied to accomplish the generation (d).
Figure 4: The proposed instant mask generation. An indexing process is first performed based on the semantic similarities between the start token and the other ones (upper left). Refinement is then operated between the index and the remaining ones (lower left). Finally, the mask is obtained via the adaptive aggregation of all attention maps.
Figure 5: The trade-offs of existing methods between different metrics. We conduct experiments by using two different metrics as the independent and dependent variables respectively. The proposed InstDiffEdit has the best trade-offs.
...and 6 more figures

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

TL;DR

Abstract

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

Authors

TL;DR

Abstract

Table of Contents

Figures (11)