Table of Contents
Fetching ...

Photoswap: Personalized Subject Swapping in Images

Jing Gu, Yilin Wang, Nanxuan Zhao, Tsu-Jui Fu, Wei Xiong, Qing Liu, Zhifei Zhang, He Zhang, Jianming Zhang, HyunJoon Jung, Xin Eric Wang

TL;DR

It is established that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-att attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image.

Abstract

In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. Photoswap first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of Photoswap in personalized subject swapping. Furthermore, Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.

Photoswap: Personalized Subject Swapping in Images

TL;DR

It is established that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-att attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image.

Abstract

In an era where images and visual content dominate our digital landscape, the ability to manipulate and personalize these images has become a necessity. Envision seamlessly substituting a tabby cat lounging on a sunlit window sill in a photograph with your own playful puppy, all while preserving the original charm and composition of the image. We present Photoswap, a novel approach that enables this immersive image editing experience through personalized subject swapping in existing images. Photoswap first learns the visual concept of the subject from reference images and then swaps it into the target image using pre-trained diffusion models in a training-free manner. We establish that a well-conceptualized visual subject can be seamlessly transferred to any image with appropriate self-attention and cross-attention manipulation, maintaining the pose of the swapped subject and the overall coherence of the image. Comprehensive experiments underscore the efficacy and controllability of Photoswap in personalized subject swapping. Furthermore, Photoswap significantly outperforms baseline methods in human ratings across subject swapping, background preservation, and overall quality, revealing its vast application potential, from entertainment to professional editing.
Paper Structure (4 sections, 4 figures)

This paper contains 4 sections, 4 figures.

Figures (4)

  • Figure 1: Results of Text Inversion gal2023ti as the concept learning module. It can successfully capture key subject features, but its performance drops when representing complex structures such as human faces.
  • Figure 2: Results at different swapping steps. With consistent steps, swapping the self-attention output provides superior control over the layout, including the subject's gestures and the background details. However, excessive swapping could affect the subject's identity, as the new concept introduced through the text prompt might be overshadowed by the swapping of the attention output or attention map. This effect is more clear when swapping the self-attention output $\lambda_\phi$. Furthermore, we observed that replacing the attention map for an extensive number of steps can result in an image with significant noise, possibly due to a compatibility issue between the attention map and the $v$ vector.
  • Figure 3: Results on real human face images across different races. Evidently, the skin colors are also successfully transferred when swapping a white person with a black person, and vice versa.
  • Figure 4: Failure cases. The model sometimes struggles to accurately reconstruct hand details and complex background information such as formula on a whiteboard.