Eye-for-an-eye: Appearance Transfer with Semantic Correspondence in Diffusion Models
Sooyeon Go, Kyungmook Choi, Minjung Shin, Youngjung Uh
TL;DR
This work tackles training-free appearance transfer in diffusion models by leveraging dense semantic correspondences between target and reference images. A semantic matching-based feature rearrangement is performed before self-attention, with the rearranged reference features injected into the target denoising process, and per-timestep matching is used to obtain accurate, object-level correspondences. The approach achieves superior appearance transfer while preserving target structure, demonstrated through strong performance on appearance similarity, structure preservation, and dense correspondence metrics, and extends to cross-domain and multi-object scenarios. The method relies on inversion for real-image references and may be limited when semantic correspondences are absent, but it offers a practical, scalable solution for training-free, semantically aware image editing with diffusion models.
Abstract
As pre-trained text-to-image diffusion models have become a useful tool for image synthesis, people want to specify the results in various ways. This paper tackles training-free appearance transfer, which produces an image with the structure of a target image from the appearance of a reference image. Existing methods usually do not reflect semantic correspondence, as they rely on query-key similarity within the self-attention layer to establish correspondences between images. To this end, we propose explicitly rearranging the features according to the dense semantic correspondences. Extensive experiments show the superiority of our method in various aspects: preserving the structure of the target and reflecting the correct color from the reference, even when the two images are not aligned.
