Table of Contents
Fetching ...

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

Aoxue Li, Mingyang Yi, Zhenguo Li

TL;DR

The proposed ``MaSaFusion'' significantly improves the existing T2I editing techniques by incorporating human annotation as an external knowledge to confine editing within a ``Mask-informed'' region.

Abstract

Recently, text-to-image (T2I) editing has been greatly pushed forward by applying diffusion models. Despite the visual promise of the generated images, inconsistencies with the expected textual prompt remain prevalent. This paper aims to systematically improve the text-guided image editing techniques based on diffusion models, by addressing their limitations. Notably, the common idea in diffusion-based editing firstly reconstructs the source image via inversion techniques e.g., DDIM Inversion. Then following a fusion process that carefully integrates the source intermediate (hidden) states (obtained by inversion) with the ones of the target image. Unfortunately, such a standard pipeline fails in many cases due to the interference of texture retention and the new characters creation in some regions. To mitigate this, we incorporate human annotation as an external knowledge to confine editing within a ``Mask-informed'' region. Then we carefully Fuse the edited image with the source image and a constructed intermediate image within the model's Self-Attention module. Extensive empirical results demonstrate the proposed ``MaSaFusion'' significantly improves the existing T2I editing techniques.

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

TL;DR

The proposed ``MaSaFusion'' significantly improves the existing T2I editing techniques by incorporating human annotation as an external knowledge to confine editing within a ``Mask-informed'' region.

Abstract

Recently, text-to-image (T2I) editing has been greatly pushed forward by applying diffusion models. Despite the visual promise of the generated images, inconsistencies with the expected textual prompt remain prevalent. This paper aims to systematically improve the text-guided image editing techniques based on diffusion models, by addressing their limitations. Notably, the common idea in diffusion-based editing firstly reconstructs the source image via inversion techniques e.g., DDIM Inversion. Then following a fusion process that carefully integrates the source intermediate (hidden) states (obtained by inversion) with the ones of the target image. Unfortunately, such a standard pipeline fails in many cases due to the interference of texture retention and the new characters creation in some regions. To mitigate this, we incorporate human annotation as an external knowledge to confine editing within a ``Mask-informed'' region. Then we carefully Fuse the edited image with the source image and a constructed intermediate image within the model's Self-Attention module. Extensive empirical results demonstrate the proposed ``MaSaFusion'' significantly improves the existing T2I editing techniques.
Paper Structure (29 sections, 14 equations, 15 figures, 1 table, 5 algorithms)

This paper contains 29 sections, 14 equations, 15 figures, 1 table, 5 algorithms.

Figures (15)

  • Figure 1: Our training-free methods on text-to-image editing on real images. The method integrates some human annotation e.g., (sketches as in text-to-image adapter mou2023t2i and restricted editing region).
  • Figure 2: The comparison of the existing (left) and our (right) fusion processes (e.g. hertz2022prompt) of generating desired target image. The standard pipeline is inversing the source image, and then fusing it with the target one. For the existing method on the left, the target image can be quite different from the source one. Thus, the fusion process between them is different in practice. We propose to first generate an intermediate image with the desired shape under external conditions (e.g. T2I Adapter mou2023t2i), then obtain the target image by fusing it with the source and intermediate images, depending on the pixels' relative position to a prior given editing region.
  • Figure 3: The typical failure cases of existing methods. The failures are mainly twofold, reconstructing source image (PnP, P2P) or generating .
  • Figure 4: The generated intermediate image with different initial noise $\boldsymbol{x}_{T}^{\rm t2i}$ where $\boldsymbol{x}_{T}^{\mathcal{S}}$ is obtained by inversion, $\boldsymbol{x}$ is a Gaussian noise, and $\textbf{1}_{A}$ is the indicator function on set $A$. The external condition (target image sketch) and shape of $A$ are also presented.
  • Figure 5: The edited images of existing methods and our MaSaFusion (NTI based). Here we present the results of single-turn editing, more results of multi-turn editing are in Appendix \ref{['app:More Results on SVE Task']}.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3