Table of Contents
Fetching ...

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

Tianrui Huang, Pu Cao, Lu Yang, Chun Liu, Mengjie Hu, Zhiwei Liu, Qing Song

TL;DR

Comprehensive quantitative and qualitative experiments demonstrate that the proposed zero-shot image editing method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.

Abstract

Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. While current editing approaches have made improvements under text guidance, most of them have only focused on preserving the information of the input image, disregarding the importance of editability and alignment to the target prompt. In this paper, we prioritize the editability by proposing a zero-shot image editing method, named \textbf{E}nhance \textbf{E}ditability for text-based image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance (\textbf{E4C}), which only requires inference-stage optimization to explicitly enhance the edibility and text alignment. Specifically, we develop a unified dual-branch feature-sharing pipeline that enables the preservation of the structure or texture of the source image while allowing the other to be adapted based on the editing task. We further integrate CLIP guidance into our pipeline by utilizing our novel random-gateway optimization mechanism to efficiently enhance the semantic alignment with the target prompt. Comprehensive quantitative and qualitative experiments demonstrate that our method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

TL;DR

Comprehensive quantitative and qualitative experiments demonstrate that the proposed zero-shot image editing method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.

Abstract

Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. While current editing approaches have made improvements under text guidance, most of them have only focused on preserving the information of the input image, disregarding the importance of editability and alignment to the target prompt. In this paper, we prioritize the editability by proposing a zero-shot image editing method, named \textbf{E}nhance \textbf{E}ditability for text-based image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance (\textbf{E4C}), which only requires inference-stage optimization to explicitly enhance the edibility and text alignment. Specifically, we develop a unified dual-branch feature-sharing pipeline that enables the preservation of the structure or texture of the source image while allowing the other to be adapted based on the editing task. We further integrate CLIP guidance into our pipeline by utilizing our novel random-gateway optimization mechanism to efficiently enhance the semantic alignment with the target prompt. Comprehensive quantitative and qualitative experiments demonstrate that our method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.
Paper Structure (32 sections, 11 equations, 14 figures, 4 tables, 1 algorithm)

This paper contains 32 sections, 11 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: E4C performs edits on various tasks. Given a real image and a target text prompt, our method can generate a new image with high alignment to the description with no affiliation of masks or segmentation maps. Our method outperforms task-specif methods even in their advantageous domains.
  • Figure 2: Dual-branch pipeline inserted with CLIP guidance. We are attempting to change the shape of a cake from round to square. Thanks to the shared key-value pairs we maintain its appearance and surroundings. We further use CLIP loss (Eqn. \ref{['eq: clip_loss']}) to refine the Q features. By doing this, we can ensure that our final results stay consistent with $P_t$.
  • Figure 3: High-level overview of our framework and comparison to previous methods. From top to bottom lay our three branches: inversion, reconstruction (source), and editing (target). See the top two branches, We take different strategies of latent-align ($z_t$ to $z_t^*$) compared to NTImokady2023null and PTIdong2023prompt. Besides, we apply adaptive feature sharing (a.f.s) between source and target branches. Note that all operations denoted by bidirectional curved arrow $\leftrightarrow$ are applied at each timestep.
  • Figure 4: Visualizing queries before and after CLIP guidance. We use PCA to visualize "making a teddy bear raise left hand" queries before and after fine-tuning with CLIP guidance, extracted from the 15th self-attention layer (16 in total).
  • Figure 5: The illustration of our random-gateway optimization mechanism.Top: the computing graphics of one single sampling step. For a gateway step, we cache all gradients contributed to U-Net parameters from both the noise term $\epsilon_\theta$ and the residual latent term $z_t$. The noise term is stopped gradient at other steps, leaving only the latent path for the gradient signal passed to the previous step. Bottom: The overview of gradient flow in the sampling process. The U-Net parameter would only be updated at the gateways but the gradient flows through along the entire sampling process.
  • ...and 9 more figures