Table of Contents
Fetching ...

What to Preserve and What to Transfer: Faithful, Identity-Preserving Diffusion-based Hairstyle Transfer

Chaeyeon Chung, Sunghyun Park, Jeongho Kim, Jaegul Choo

TL;DR

HairFusion tackles the challenging problem of transferring a reference hairstyle to a face image while preserving identity, clothing, and background under real-world conditions. It introduces a diffusion-based, one-stage framework that treats hairstyle transfer as exemplar-based image inpainting, featuring a hair-agnostic input, the hair Align Cross-Attention (Align-CA) for pose-aware hair alignment, and adaptive hair blending to protect non-hair regions during inference. The method combines a hair encoder, a dense-pose–augmented cross-attention module, and CLIP-based conditioning within a latent diffusion model, trained on multi-view datasets. Experimental results show state-of-the-art performance in both hairstyle transfer quality and reconstruction accuracy, including robust performance on in-the-wild images with diverse poses and focal lengths, indicating strong practical applicability and generalization.

Abstract

Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pre-trained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their head poses. To enhance the preservation of the face image's original features, we leverage adaptive hair blending during the inference, where the output's hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features. The codes are available at https://github.com/cychungg/HairFusion

What to Preserve and What to Transfer: Faithful, Identity-Preserving Diffusion-based Hairstyle Transfer

TL;DR

HairFusion tackles the challenging problem of transferring a reference hairstyle to a face image while preserving identity, clothing, and background under real-world conditions. It introduces a diffusion-based, one-stage framework that treats hairstyle transfer as exemplar-based image inpainting, featuring a hair-agnostic input, the hair Align Cross-Attention (Align-CA) for pose-aware hair alignment, and adaptive hair blending to protect non-hair regions during inference. The method combines a hair encoder, a dense-pose–augmented cross-attention module, and CLIP-based conditioning within a latent diffusion model, trained on multi-view datasets. Experimental results show state-of-the-art performance in both hairstyle transfer quality and reconstruction accuracy, including robust performance on in-the-wild images with diverse poses and focal lengths, indicating strong practical applicability and generalization.

Abstract

Hairstyle transfer is a challenging task in the image editing field that modifies the hairstyle of a given face image while preserving its other appearance and background features. The existing hairstyle transfer approaches heavily rely on StyleGAN, which is pre-trained on cropped and aligned face images. Hence, they struggle to generalize under challenging conditions such as extreme variations of head poses or focal lengths. To address this issue, we propose a one-stage hairstyle transfer diffusion model, HairFusion, that applies to real-world scenarios. Specifically, we carefully design a hair-agnostic representation as the input of the model, where the original hair information is thoroughly eliminated. Next, we introduce a hair align cross-attention (Align-CA) to accurately align the reference hairstyle with the face image while considering the difference in their head poses. To enhance the preservation of the face image's original features, we leverage adaptive hair blending during the inference, where the output's hair regions are estimated by the cross-attention map in Align-CA and blended with non-hair areas of the face image. Our experimental results show that our method achieves state-of-the-art performance compared to the existing methods in preserving the integrity of both the transferred hairstyle and the surrounding features. The codes are available at https://github.com/cychungg/HairFusion
Paper Structure (23 sections, 8 equations, 10 figures, 4 tables)

This paper contains 23 sections, 8 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Generated results by HairFusion using in-the-wild images. Given a pair of a face image and a reference hairstyle image, our method can generate high-fidelity images. These results show our model's generalizability to diverse face images.
  • Figure 2: Overall pipeline of HairFusion. (a) HairFusion consists of a pre-trained U-Net, a hair encoder, a hair align cross-attention (Align-CA), and a pose encoder. We first preprocess hair-agnostic image $\textbf{x}_{agn}$ using a hair mask $\textbf{m}_{hair}$ and a face outline image $\textbf{f}_{agn}$. Then, we provide $\textbf{x}_{agn}$, a hair-agnostic mask $\textbf{m}_{agn}$, a hair image $\textbf{x}_{hair}$ and dense pose images $\textbf{p}_{agn}$, $\textbf{p}_{hair}$ as inputs to the model. (b) To inject $\textbf{x}_{hair}$ into $\textbf{x}_{agn}$, we leverage the Align-CA, which aligns the hair features with the face features via cross-attention. Here, the pose features are added to the query (Q) and the key (K) as additional guidance.
  • Figure 3: Overview of adaptive hair blending. We obtain $\mathbf{m}_{blend}$ using $\mathbf{m}_{ca}$ extracted from CA maps in Align-CA and the source hair mask $\mathbf{m}_{hair}$. $\mathbf{m}_{blend}$ blends the generated hair features with the other features in the source.
  • Figure 4: Qualitative comparison with the diffusion-based baselines.
  • Figure 5: Qualitative comparison with baselines using web-crawled images.
  • ...and 5 more figures