Table of Contents
Fetching ...

TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces

Bernardo Biesseck, Pedro Vidal, Luiz Coelho, Roger Granada, David Menotti|

TL;DR

The paper addresses privacy-driven limits on real face data by introducing TCDiff, a Triple Condition Diffusion Model that stylizes synthetic identities with real-world style under 2D and 3D constraints. It combines a 2D style embedding and 3DMM-based shape constraints within a diffusion-based face mixer, optimizing with a mixed loss $L_T = L_{MSE} + \lambda_{id} L_{ID} + \lambda_{3D} L_{3D}$. Empirically, TCDiff improves intra-class identity consistency and synthetic-data quality, achieving competitive FR performance on standard benchmarks for small-to-moderate class regimes and highlighting a trade-off between consistency and inter-class variance. The work provides code for reproducible synthesis of high-fidelity, identity-consistent synthetic faces and points to future extensions involving pose/expression constraints to broaden applicability.

Abstract

A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: https://github.com/BOVIFOCR/tcdiff.

TCDiff: Triple Condition Diffusion Model with 3D Constraints for Stylizing Synthetic Faces

TL;DR

The paper addresses privacy-driven limits on real face data by introducing TCDiff, a Triple Condition Diffusion Model that stylizes synthetic identities with real-world style under 2D and 3D constraints. It combines a 2D style embedding and 3DMM-based shape constraints within a diffusion-based face mixer, optimizing with a mixed loss . Empirically, TCDiff improves intra-class identity consistency and synthetic-data quality, achieving competitive FR performance on standard benchmarks for small-to-moderate class regimes and highlighting a trade-off between consistency and inter-class variance. The work provides code for reproducible synthesis of high-fidelity, identity-consistent synthetic faces and points to future extensions involving pose/expression constraints to broaden applicability.

Abstract

A robust face recognition model must be trained using datasets that include a large number of subjects and numerous samples per subject under varying conditions (such as pose, expression, age, noise, and occlusion). Due to ethical and privacy concerns, large-scale real face datasets have been discontinued, such as MS1MV3, and synthetic face generators have been proposed, utilizing GANs and Diffusion Models, such as SYNFace, SFace, DigiFace-1M, IDiff-Face, DCFace, and GANDiffFace, aiming to supply this demand. Some of these methods can produce high-fidelity realistic faces, but with low intra-class variance, while others generate high-variance faces with low identity consistency. In this paper, we propose a Triple Condition Diffusion Model (TCDiff) to improve face style transfer from real to synthetic faces through 2D and 3D facial constraints, enhancing face identity consistency while keeping the necessary high intra-class variance. Face recognition experiments using 1k, 2k, and 5k classes of our new dataset for training outperform state-of-the-art synthetic datasets in real face benchmarks such as LFW, CFP-FP, AgeDB, and BUPT. Our source code is available at: https://github.com/BOVIFOCR/tcdiff.
Paper Structure (7 sections, 11 equations, 5 figures, 1 table)

This paper contains 7 sections, 11 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of the proposed synthetic face mixer TCDiff with 2D and 3D consistency constraints (gray elements). Intermediate style features are extracted from a style image $X_{sty}$ and applied to a synthetic identity image $X_{id}$, generating a new stylized sample $\hat{X}_{0}$.
  • Figure 2: Face samples of real (first row) and synthetic datasets. SYNFace has low identity consistency and low intra-class variance; SFace has low identity consistency; DigiFace-1M, GANDiffFace, and IDiffFace have high identity consistency but low intra-class variance; DCFace has low identity consistency and high intra-class variance.
  • Figure 3: Simplified illustration of internal architecture of the U-Net DDPM face mixer $\epsilon_{\theta}$. Intermediate feature maps (blue boxes) are extracted from $X_t$, a noisy version of $X_{sty}$, by encoders and copied to their corresponding upscale layer. Identity features $E_{id}$, extracted from $X_{id}$, and style features $E_{sty}$, extracted from $X_{sty}$, are fed to attention modules together with intermediate feature maps, allowing the model learn how to denoise $X_t$ with features from $X_{id}$ and $X_{sty}$.
  • Figure 4: Stylized synthetic face samples generated with DCFace 10204758 and our proposed TCDiff face mixer. The first row shows the original synthetic $X_{id}$ face and $16$ real style faces $X_{sty}$ used to create new samples. In our experiments, $\lambda_{3\text{D}}=0.001$ is the best value to balance intra-class identity consistency and variance.
  • Figure 5: Intra-class cosine similarities of synthetic datasets generated by DCFace 10204758 and our proposed model TCDiff with different values of $\lambda_{3\text{D}}$, computed with ResNet100/Arcface Deng2019 trained on MS1MV3 Deng_2019_ICCV. The higher the $\lambda_{3\text{D}}$ value, the higher the intra-class identity consistency.