Table of Contents
Fetching ...

Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth

Zhaoyang Sun, Shengwu Xiong, Yaxiong Chen, Yi Rong

TL;DR

This paper tackles unsupervised makeup transfer without relying on pseudo ground truths. It introduces Content-Style Decoupled Makeup Transfer (CSD-MT), which decouples makeup style (low-frequency) from content (high-frequency) via frequency decomposition and aligns them separately using a semantic correspondence map and a SPADE-based makeup renderer. The method employs novel losses, including a self-augmented reconstructive loss and a color contrastive loss, and demonstrates superior performance in realism (FID/PSNR/SSIM) and transfer quality across multiple datasets, with strong efficiency. The approach also offers extensive makeup control capabilities (global/local interpolation, partial transfer, editing) and shows robustness to pose/expression variations, making it practical for real-world makeup transfer tasks. Overall, CSD-MT eliminates dependence on PGTs and delivers accurate, controllable makeup transfer with improved efficiency and generalization.

Abstract

The absence of real targets to guide the model training is one of the main problems with the makeup transfer task. Most existing methods tackle this problem by synthesizing pseudo ground truths (PGTs). However, the generated PGTs are often sub-optimal and their imprecision will eventually lead to performance degradation. To alleviate this issue, in this paper, we propose a novel Content-Style Decoupled Makeup Transfer (CSD-MT) method, which works in a purely unsupervised manner and thus eliminates the negative effects of generating PGTs. Specifically, based on the frequency characteristics analysis, we assume that the low-frequency (LF) component of a face image is more associated with its makeup style information, while the high-frequency (HF) component is more related to its content details. This assumption allows CSD-MT to decouple the content and makeup style information in each face image through the frequency decomposition. After that, CSD-MT realizes makeup transfer by maximizing the consistency of these two types of information between the transferred result and input images, respectively. Two newly designed loss functions are also introduced to further improve the transfer performance. Extensive quantitative and qualitative analyses show the effectiveness of our CSD-MT method. Our code is available at https://github.com/Snowfallingplum/CSD-MT.

Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth

TL;DR

This paper tackles unsupervised makeup transfer without relying on pseudo ground truths. It introduces Content-Style Decoupled Makeup Transfer (CSD-MT), which decouples makeup style (low-frequency) from content (high-frequency) via frequency decomposition and aligns them separately using a semantic correspondence map and a SPADE-based makeup renderer. The method employs novel losses, including a self-augmented reconstructive loss and a color contrastive loss, and demonstrates superior performance in realism (FID/PSNR/SSIM) and transfer quality across multiple datasets, with strong efficiency. The approach also offers extensive makeup control capabilities (global/local interpolation, partial transfer, editing) and shows robustness to pose/expression variations, making it practical for real-world makeup transfer tasks. Overall, CSD-MT eliminates dependence on PGTs and delivers accurate, controllable makeup transfer with improved efficiency and generalization.

Abstract

The absence of real targets to guide the model training is one of the main problems with the makeup transfer task. Most existing methods tackle this problem by synthesizing pseudo ground truths (PGTs). However, the generated PGTs are often sub-optimal and their imprecision will eventually lead to performance degradation. To alleviate this issue, in this paper, we propose a novel Content-Style Decoupled Makeup Transfer (CSD-MT) method, which works in a purely unsupervised manner and thus eliminates the negative effects of generating PGTs. Specifically, based on the frequency characteristics analysis, we assume that the low-frequency (LF) component of a face image is more associated with its makeup style information, while the high-frequency (HF) component is more related to its content details. This assumption allows CSD-MT to decouple the content and makeup style information in each face image through the frequency decomposition. After that, CSD-MT realizes makeup transfer by maximizing the consistency of these two types of information between the transferred result and input images, respectively. Two newly designed loss functions are also introduced to further improve the transfer performance. Extensive quantitative and qualitative analyses show the effectiveness of our CSD-MT method. Our code is available at https://github.com/Snowfallingplum/CSD-MT.
Paper Structure (29 sections, 14 equations, 29 figures, 5 tables)

This paper contains 29 sections, 14 equations, 29 figures, 5 tables.

Figures (29)

  • Figure 1: The comparison of different training strategies.
  • Figure 2: The PGTs and transferred results generated by different categories of makeup transfer methods.
  • Figure 3: Visualization of the frequency components decomposed from the source image and the transferred results. The low-frequency components are resized for better visualization. The mean square errors of the different components between source images and transferred results are marked in the lower left corner.
  • Figure 4: Illustration of the proposed CSD-MT framework. (a) Given a source image $x$ and a reference image $y$, the semantic correspondence module first constructs a pixel-wise correlation matrix $M$ between them. Next, by performing face parsing and frequency decomposition, the makeup rendering module $G_{mr}$ obtains the background area $x_{bg}$ and the HF component $x_{h}$ that contain the content information of $x$, as well as the LF component $y_{l}$ comprising the makeup style of $y$. Then, each pixel in $\hat{y}_{l}$ aggregates the information from the corresponding pixels in $y_{l}$ according to the correlation matrix $M$. Finally, the final transferred result $\hat{x}=G_{mr}([x_{bg}$, $x_{h}],\hat{y}_{l})$ is generated using $x_{bg}$, $x_{h}$ and $\hat{y}_{l}$. Furthermore, we introduce a self-augmented reconstructive loss (b) and a color contrastive loss (c) to enhance the transfer of the spatial and color information in makeup, respectively.
  • Figure 5: Qualitative comparison with several state-of-the-art methods on different makeup styles. The proposed CSD-MT produces the most precise transferred results with desired makeup information and high-quality content details. Please zoom in for better comparison.
  • ...and 24 more figures