Table of Contents
Fetching ...

Harmonizing Attention: Training-free Texture-aware Geometry Transfer

Eito Ikuta, Yohan Lee, Akihiro Iohara, Yu Saito, Toshiyuki Tanaka

TL;DR

Harmonizing Attention tackles geometry transfer across materials without model training by modifying diffusion-model self-attention to reference multiple references. Texture-aligning Attention during inversion and Geometry-preserving Attention during generation enable decoupling geometry from material texture while maintaining texture continuity, all without fine-tuning. The approach demonstrates superior geometry fidelity and perceptual harmony in both qualitative and quantitative evaluations against strong baselines, including lightweight user studies. This training-free framework broadens practical image compositing applications such as augmented reality and advanced image editing by providing robust, geometry-first harmonization with minimal dataset requirements.

Abstract

Extracting geometry features from photographic images independently of surface texture and transferring them onto different materials remains a complex challenge. In this study, we introduce Harmonizing Attention, a novel training-free approach that leverages diffusion models for texture-aware geometry transfer. Our method employs a simple yet effective modification of self-attention layers, allowing the model to query information from multiple reference images within these layers. This mechanism is seamlessly integrated into the inversion process as Texture-aligning Attention and into the generation process as Geometry-aligning Attention. This dual-attention approach ensures the effective capture and transfer of material-independent geometry features while maintaining material-specific textural continuity, all without the need for model fine-tuning.

Harmonizing Attention: Training-free Texture-aware Geometry Transfer

TL;DR

Harmonizing Attention tackles geometry transfer across materials without model training by modifying diffusion-model self-attention to reference multiple references. Texture-aligning Attention during inversion and Geometry-preserving Attention during generation enable decoupling geometry from material texture while maintaining texture continuity, all without fine-tuning. The approach demonstrates superior geometry fidelity and perceptual harmony in both qualitative and quantitative evaluations against strong baselines, including lightweight user studies. This training-free framework broadens practical image compositing applications such as augmented reality and advanced image editing by providing robust, geometry-first harmonization with minimal dataset requirements.

Abstract

Extracting geometry features from photographic images independently of surface texture and transferring them onto different materials remains a complex challenge. In this study, we introduce Harmonizing Attention, a novel training-free approach that leverages diffusion models for texture-aware geometry transfer. Our method employs a simple yet effective modification of self-attention layers, allowing the model to query information from multiple reference images within these layers. This mechanism is seamlessly integrated into the inversion process as Texture-aligning Attention and into the generation process as Geometry-aligning Attention. This dual-attention approach ensures the effective capture and transfer of material-independent geometry features while maintaining material-specific textural continuity, all without the need for model fine-tuning.
Paper Structure (22 sections, 7 equations, 6 figures, 2 tables)

This paper contains 22 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Examples of texture-aware geometry transfer. Our method seamlessly transfers geometric features (holes, cracks, droplets, etc.) from source images (red rectangles in the insets of the top row) to target images with different surface textures (main images in the top row).
  • Figure 2: The overview of Harmonizing Attention framework. The process consists of four main stages: (1) Editing: The masked region of the source image $\boldsymbol{M}^\mathrm{src} \odot \boldsymbol{I}^\mathrm{src}$ is transformed, positioned, and color-adjusted to create a geometry image $\boldsymbol{I}^\mathrm{geo}$ and a corresponding mask $\boldsymbol{M}^\mathrm{geo}$. (2) Inversion: The process commences with the compression of the source, target, and geometry images via VAE, yielding clear latents $\boldsymbol{z}^\mathrm{src}_0$, $\boldsymbol{z}^\mathrm{tar}_0$, and $\boldsymbol{z}^\mathrm{geo}_0$ respectively. Subsequently, these latents, along with the source and geometry masks resized to the latent space, are fed into an SD inpainting model. This step results in the generation of inverted noisy latents $\boldsymbol{z}^\mathrm{src}_T$, $\boldsymbol{z}^\mathrm{tar}_T$, and $\boldsymbol{z}^\mathrm{geo}_T$. Here we replace the self-attention computation for the geometry image with Texture-aligning Attention, which incorporates information from the target image to align the geometry image with the target domain. (3) Blending: The latents of the target and geometry images are combined using the geometry mask. (4) Generation: The blended latents undergo denoising, with modified self-attention named Geometry-preserving Attention, to reference the source image, preserving the geometry while ensuring seamless integration. For the sake of enhanced clarity and readability, all latents are presented in pixel space, rather than in the VAE latent space.
  • Figure 3: Qualitative comparison of image generation results for geometry transfer. The methods to be compared are set to maximize the quality of the generated images, and the highest quality results are selected and included. Each column shows the output result by the compared method for harmonizing and the output result by our method.
  • Figure 4: Qualitative comparison of images generated with different types of color adjustment. Here are examples of two paired images. The first columns are the background and foreground images to be input to our method. For each pair, the top line shows the pasted image with the foreground target attached to the background, and the bottom line shows the output from our method for each pasted image. Each line is the result of (i) using the original color image, (ii) and (iii) are color-shifted images of the target image using the color shift parameter, and (iv) using the color matched image with histogram matching method.
  • Figure 5: Qualitative comparison of images generated with different types of attention during the inversion phase. The first column shows pasted images, which we refer to as geometry images overlaid on target images, as shown in \ref{['fig:overview']}. The second column shows results for replacing $\boldsymbol{K}^\mathrm{geo}$ and $\boldsymbol{V}^\mathrm{geo}$ with $\boldsymbol{K}^\mathrm{tar}$ and $\boldsymbol{V}^\mathrm{tar}$. The third column shows results with the ordinary self-attention computation. The fourth column shows results with Texture-aligning Attention, where target- and geometry-derived keys and values are concatenated as shown in \ref{['eq:inversion-attention']}.
  • ...and 1 more figures