Table of Contents
Fetching ...

Zero-Shot Head Swapping in Real-World Scenarios

Taewoong Kang, Sohyun Jeong, Hyojin Jang, Jaegul Choo

TL;DR

This work addresses the challenge of zero-shot head swapping in real-world images that include full heads and upper bodies with diverse poses. The authors introduce HID, a diffusion-based framework that integrates an IOMask for automatic context-aware masking and a Hair Injection Module to preserve hairstyle details, enabling seamless head-body fusion. By leveraging DDIM inversion, identity-oriented embedding fusion (PhotoMaker V2), and ControlNet-driven pose conditioning, HID achieves state-of-the-art results on challenging data, outperforming baselines in hair fidelity, identity preservation, and image quality. The approach reduces the need for cropping or post-hoc compositing, improving practicality for real-world applications in media, avatars, and editing workflows.

Abstract

With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques, integrating an entire head from the head image with the body from the body image, has increased. However, traditional head swapping methods heavily rely on face-centered cropped data with primarily frontal facing views, which limits their effectiveness in real world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging conditions.

Zero-Shot Head Swapping in Real-World Scenarios

TL;DR

This work addresses the challenge of zero-shot head swapping in real-world images that include full heads and upper bodies with diverse poses. The authors introduce HID, a diffusion-based framework that integrates an IOMask for automatic context-aware masking and a Hair Injection Module to preserve hairstyle details, enabling seamless head-body fusion. By leveraging DDIM inversion, identity-oriented embedding fusion (PhotoMaker V2), and ControlNet-driven pose conditioning, HID achieves state-of-the-art results on challenging data, outperforming baselines in hair fidelity, identity preservation, and image quality. The approach reduces the need for cropping or post-hoc compositing, improving practicality for real-world applications in media, avatars, and editing workflows.

Abstract

With growing demand in media and social networks for personalized images, the need for advanced head-swapping techniques, integrating an entire head from the head image with the body from the body image, has increased. However, traditional head swapping methods heavily rely on face-centered cropped data with primarily frontal facing views, which limits their effectiveness in real world applications. Additionally, their masking methods, designed to indicate regions requiring editing, are optimized for these types of dataset but struggle to achieve seamless blending in complex situations, such as when the original data includes features like long hair extending beyond the masked area. To overcome these limitations and enhance adaptability in diverse and complex scenarios, we propose a novel head swapping method, HID, that is robust to images including the full head and the upper body, and handles from frontal to side views, while automatically generating context aware masks. For automatic mask generation, we introduce the IOMask, which enables seamless blending of the head and body, effectively addressing integration challenges. We further introduce the hair injection module to capture hair details with greater precision. Our experiments demonstrate that the proposed approach achieves state-of-the-art performance in head swapping, providing visually consistent and realistic results across a wide range of challenging conditions.

Paper Structure

This paper contains 33 sections, 5 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Head-swapped images generated by our approach. Using the proposed method, HID, the head in the images of Head column is seamlessly integrated onto the images of Body column, resulting in realistic and cohesive head-swapped images in the Swapped column.
  • Figure 2: Face swapping simply applies the head image’s face ID to the body image. In contrast, head swapping requires applying not only the head image’s face ID but also the hairstyle, face shape, and skin tone.
  • Figure 3: Most previous head-swapping methods deepfacelabrefacefaceX rely on datasets that require involving the zoomed-in, face-centered images as shown in (a). In this case, an additional step is needed to merge the head-swapped images back onto the original body, which can often result in mismatched or unnatural outcomes. In contrast, our approach uses the dataset that includes the whole upper body, as shown in (b), enabling head swapping in more realistic, real-world scenarios.
  • Figure 4: When existing methods that use face-centered cropped datasets deepfacelabrefacefaceX handle cases where the head-swapped image need to be pasted back with the original body image, they could generate inharmonious outcomes because the remaining parts in the body image cannot be addressed. The original body image (a) has brown hair, while the head-swapped image (c) has blonde hair, causing the final image (b) to show a noticeable color inconsistency in the hair.
  • Figure 5: Overview of HID. Our HID consists of two main stages (left and right). In the left stage (blue region), we obtain updated text embeddings by fusing embeddings. These fused ID embeddings and fused hair embeddings replace the part of original text embeddings, resulting in updated text embeddings. In the right stage (white region), the final output $I_o$, a head swapped image, is generated. DDIM inversion is performed to reconstruct the image while leveraging our IOMask to infer which parts of the body image should be removed, thereby generating the head image. During this process, the updated text embeddings obtained in the left stage, along with the output from ControlNet, serves as conditioning inputs for the diffusion model.
  • ...and 6 more figures