StyleDiT: A Unified Framework for Diverse Child and Partner Faces Synthesis with Style Latent Diffusion Transformer
Pin-Yen Chiu, Dai-Jie Wu, Po-Hsun Chu, Chia-Hsuan Hsu, Hsiang-Chen Chiu, Chih-Yu Wang, Jun-Cheng Chen
TL;DR
StyleDiT tackles kinship face synthesis under data scarcity by unifying StyleGAN latent control with a diffusion transformer. It introduces Relational Trait Guidance (RTG) for independent control of parental attributes, enabling a tunable balance between diversity and fidelity, and extends the framework to predict partner faces from a child and one parent. Trained on a large synthetic kinship dataset, StyleDiT demonstrates strong diverse outputs while preserving parental traits, outperforming several baselines in qualitative and quantitative evaluations. The method also highlights the practicality of synthetic data for modeling complex kinship distributions when real data are limited or low quality. Overall, StyleDiT provides a flexible, high-fidelity, and controllable approach to kinship-aware face synthesis with direct applications to forensic and genetic research contexts.
Abstract
Kinship face synthesis is a challenging problem due to the scarcity and low quality of the available kinship data. Existing methods often struggle to generate descendants with both high diversity and fidelity while precisely controlling facial attributes such as age and gender. To address these issues, we propose the Style Latent Diffusion Transformer (StyleDiT), a novel framework that integrates the strengths of StyleGAN with the diffusion model to generate high-quality and diverse kinship faces. In this framework, the rich facial priors of StyleGAN enable fine-grained attribute control, while our conditional diffusion model is used to sample a StyleGAN latent aligned with the kinship relationship of conditioning images by utilizing the advantage of modeling complex kinship relationship distribution. StyleGAN then handles latent decoding for final face generation. Additionally, we introduce the Relational Trait Guidance (RTG) mechanism, enabling independent control of influencing conditions, such as each parent's facial image. RTG also enables a fine-grained adjustment between the diversity and fidelity in synthesized faces. Furthermore, we extend the application to an unexplored domain: predicting a partner's facial images using a child's image and one parent's image within the same framework. Extensive experiments demonstrate that our StyleDiT outperforms existing methods by striking an excellent balance between generating diverse and high-fidelity kinship faces.
