Table of Contents
Fetching ...

StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

Pin-Yen Chiu, Dai-Jie Wu, Po-Hsun Chu, Chia-Hsuan Hsu, Hsiang-Chen Chiu, Chih-Yu Wang, Jun-Cheng Chen

TL;DR

StyleDiT tackles kinship face synthesis under data scarcity by unifying StyleGAN latent control with a diffusion transformer. It introduces Relational Trait Guidance (RTG) for independent control of parental attributes, enabling a tunable balance between diversity and fidelity, and extends the framework to predict partner faces from a child and one parent. Trained on a large synthetic kinship dataset, StyleDiT demonstrates strong diverse outputs while preserving parental traits, outperforming several baselines in qualitative and quantitative evaluations. The method also highlights the practicality of synthetic data for modeling complex kinship distributions when real data are limited or low quality. Overall, StyleDiT provides a flexible, high-fidelity, and controllable approach to kinship-aware face synthesis with direct applications to forensic and genetic research contexts.

Abstract

Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent's facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.

StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer

TL;DR

StyleDiT tackles kinship face synthesis under data scarcity by unifying StyleGAN latent control with a diffusion transformer. It introduces Relational Trait Guidance (RTG) for independent control of parental attributes, enabling a tunable balance between diversity and fidelity, and extends the framework to predict partner faces from a child and one parent. Trained on a large synthetic kinship dataset, StyleDiT demonstrates strong diverse outputs while preserving parental traits, outperforming several baselines in qualitative and quantitative evaluations. The method also highlights the practicality of synthetic data for modeling complex kinship distributions when real data are limited or low quality. Overall, StyleDiT provides a flexible, high-fidelity, and controllable approach to kinship-aware face synthesis with direct applications to forensic and genetic research contexts.

Abstract

Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent's facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.

Paper Structure

This paper contains 39 sections, 12 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: The overview of our image encoder. The input image is first encoded into latent code using e4e. Then, the latent code is adjusted for age, gender, and pose attributes using an attribute block, and finally projected into the $S$ space through affine transformer layers.
  • Figure 2: The overview of the proposed framework. For both child and partner prediction tasks, input images are first encoded using an image encoder. The encoded style latents, $S_{in_1}$ and $S_{in_2}$, serve as conditions during the diffusion process. The sampled noisy latent undergoes processing through multiple Denoising Transformer blocks, resulting in the predicted face's latent $S_{out}$. Finally, StyleGAN2 decodes this predicted latent to generate a high-fidelity kinship face. The lock icon indicates that the block is frozen during training.
  • Figure 2: The comparison of controlling age, gender, and frontalization while preserving skin tone. For the first and second rows, our image encoder was used, and for the third and fourth rows, the Kinstyle image encoder was used, with all images generated by StyleGAN2.
  • Figure 3: The effect of different guidance scales during inference. Progressing from left to right, each image results from varying pairs of guidance scales, with higher scales producing synthesized images that more closely resemble the specified conditions.
  • Figure 3: The visual comparison showcases images from the FIW dataset, the FIW dataset after applying super-resolution zhou2022towards, and our simulated dataset. In each block, rows represent families, with images of the father, mother, and child displayed from left to right. Red bounding boxes highlight artifacts in the FIW dataset after applying super-resolution.
  • ...and 7 more figures