Table of Contents
Fetching ...

Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

Sooyeon Go, Kyungmook Choi, Minjung Shin, Youngjung Uh

TL;DR

This work tackles training-free appearance transfer in diffusion models by leveraging dense semantic correspondences between target and reference images. A semantic matching-based feature rearrangement is performed before self-attention, with the rearranged reference features injected into the target denoising process, and per-timestep matching is used to obtain accurate, object-level correspondences. The approach achieves superior appearance transfer while preserving target structure, demonstrated through strong performance on appearance similarity, structure preservation, and dense correspondence metrics, and extends to cross-domain and multi-object scenarios. The method relies on inversion for real-image references and may be limited when semantic correspondences are absent, but it offers a practical, scalable solution for training-free, semantically aware image editing with diffusion models.

Abstract

As pre-trained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. This paper tackles training-free appearance transfer, which produces an image with the structure of a target image from the appearance of a reference image. Existing methods usually do not reflect semantic correspondence, as they rely on query-key similarity within the self-attention layer to establish correspondences between images. To this end, we propose explicitly rearranging the features according to the dense semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the correct color from the reference, even when the two images are not aligned.

Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models

TL;DR

This work tackles training-free appearance transfer in diffusion models by leveraging dense semantic correspondences between target and reference images. A semantic matching-based feature rearrangement is performed before self-attention, with the rearranged reference features injected into the target denoising process, and per-timestep matching is used to obtain accurate, object-level correspondences. The approach achieves superior appearance transfer while preserving target structure, demonstrated through strong performance on appearance similarity, structure preservation, and dense correspondence metrics, and extends to cross-domain and multi-object scenarios. The method relies on inversion for real-image references and may be limited when semantic correspondences are absent, but it offers a practical, scalable solution for training-free, semantically aware image editing with diffusion models.

Abstract

As pre-trained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. This paper tackles training-free appearance transfer, which produces an image with the structure of a target image from the appearance of a reference image. Existing methods usually do not reflect semantic correspondence, as they rely on query-key similarity within the self-attention layer to establish correspondences between images. To this end, we propose explicitly rearranging the features according to the dense semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the correct color from the reference, even when the two images are not aligned.
Paper Structure (43 sections, 12 equations, 23 figures, 9 tables)

This paper contains 43 sections, 12 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: Our method transfers semantically corresponding appearances from reference images to target images. In contrast to other methods such as DiffEditor diffeditor and Cross-Image cross_image, our method preserves the structure of the target images successfully transfers the colors and patterns considering the semantic meanings from the references.
  • Figure 2: Pipeline of our method. We transfer the semantically corresponding appearance of objects from a reference image to a target image. Given $I^\text{ref}$, $I^\text{target}$, and their masks $M^\text{ref}$ and $M^\text{target}$, we find semantic correspondences between their features before the self-attention layers $F^\text{ref}_t$ and $F^\text{output}_t$. Then, we inject the rearranged features based on these correspondences.
  • Figure 3: Query-key attention maps vs. our feature matching. For each query pixel $\mathbf{q}$ denoted by colored markers in the target image, we show the attention maps based on the $QK$ attention score. (b) and (d) include other regions in the attention map where matching is incorrect. In contrast, the feature matching in (c) and (e) presents a single point with the correct semantic meaning.
  • Figure 4: Feature rearrangement and injection. The reference feature, rearranged based on similarity to the output feature, is injected into the output denoising process.
  • Figure 5: Comparison between conventional matching methods and ours. (a) Conventional methods aggregate features from multiple time steps of the reference and target into a single set and perform matching only once. (b) Ours matches the reference features with the output features and performs multiple matches across individual steps.
  • ...and 18 more figures