Table of Contents
Fetching ...

TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer

Zihan Su, Junhao Zhuang, Chun Yuan

TL;DR

TextureDiffusion addresses the limitation of transferring complex textures by disentangling texture from content via setting the target prompt to '<texture>'. It introduces a structure-preservation module that injects query features into self-attention and residual blocks, and an edit localization technique that leverages cross-attention maps to confine edits to the target region. The method operates in the Stable Diffusion latent space in a tuning-free fashion, achieving harmonious texture transfer with preserved structure and background. Experiments on PIE-Bench show superior performance against multiple baselines, and the authors provide public code for reproducibility at https://github.com/THU-CVML/TextureDiffusion.

Abstract

Recently, text-guided image editing has achieved significant success. However, existing methods can only apply simple textures like wood or gold when changing the texture of an object. Complex textures such as cloud or fire pose a challenge. This limitation stems from that the target prompt needs to contain both the input image content and <texture>, restricting the texture representation. In this paper, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is directly set to "<texture>", making the texture disentangled from the input image content to enhance texture representation. Subsequently, query features in self-attention and features in residual blocks are utilized to preserve the structure of the input image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents. Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation. Code is publicly available at https://github.com/THU-CVML/TextureDiffusion

TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer

TL;DR

TextureDiffusion addresses the limitation of transferring complex textures by disentangling texture from content via setting the target prompt to '<texture>'. It introduces a structure-preservation module that injects query features into self-attention and residual blocks, and an edit localization technique that leverages cross-attention maps to confine edits to the target region. The method operates in the Stable Diffusion latent space in a tuning-free fashion, achieving harmonious texture transfer with preserved structure and background. Experiments on PIE-Bench show superior performance against multiple baselines, and the authors provide public code for reproducibility at https://github.com/THU-CVML/TextureDiffusion.

Abstract

Recently, text-guided image editing has achieved significant success. However, existing methods can only apply simple textures like wood or gold when changing the texture of an object. Complex textures such as cloud or fire pose a challenge. This limitation stems from that the target prompt needs to contain both the input image content and <texture>, restricting the texture representation. In this paper, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is directly set to "<texture>", making the texture disentangled from the input image content to enhance texture representation. Subsequently, query features in self-attention and features in residual blocks are utilized to preserve the structure of the input image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents. Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation. Code is publicly available at https://github.com/THU-CVML/TextureDiffusion
Paper Structure (9 sections, 3 equations, 4 figures, 1 table)

This paper contains 9 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Existing text-guided image editing methods cannot transfer complex textures. By making the texture disentangled from the description of the input image in the target prompt and applying the proposed structure preservation module and edit localization technique, TextureDiffusion can harmoniously transfer various textures to the target object.
  • Figure 2: Pipeline of the proposed TextureDiffusion. (a) Our method inverts the input image into an initial latent $Z_{T}^{*}$ and denoises it using DDIM sampling. In the denoising process, we directly set the target prompt to "$<$texture$>$". (b) For structure preservation, query features in self-attention and features in residual blocks are injected during the generation of the edited image. For edit localization, we utilize self-attention results and mask obtained from the cross-attention map.
  • Figure 3: Results of qualitative comparisons. The blue word represents the texture. For our method, the target prompt is "$<$texture$>$" only. For the other methods, the target prompt is a complete sentence. Best viewed with zoom in.
  • Figure 4: Results of ablation study.