Table of Contents
Fetching ...

Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation

Yunxuan Cai, Sitao Xiang, Zongjian Li, Haiwei Chen, Yajie Zhao

TL;DR

The paper tackles the challenge of creating diverse, semantically controllable 3D face assets with realistic textures suitable for PBR rendering. It introduces a diffusion-driven data generation pipeline to build a large UV-texture/head-geometry dataset, a two-stage GAN-based generator to produce geometry and albedo conditioned on demographic attributes, and a texture-normalization mechanism to convert diffusion-derived textures into clean albedo maps. Asset refinement further adds high-frequency detail, specular/displacement maps, and secondary components, enabling production-ready assets with inversion and editing capabilities. An interactive web interface demonstrates practical usability, and extensive experiments show improved semantic control, texture quality, and faster generation times compared to prior methods. The approach offers a scalable path for diverse synthetic avatar creation in VFX, gaming, and data generation, while acknowledging limitations around geometry diversity, UV completion, and diffusion biases.

Abstract

Digital modeling and reconstruction of human faces serve various applications. However, its availability is often hindered by the requirements of data capturing devices, manual labor, and suitable actors. This situation restricts the diversity, expressiveness, and control over the resulting models. This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. To enhance diversity beyond the limited human faces scanned in a controlled setting, we introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. Our proposed normalization module converts synthesized data from the diffusion model into high-quality scanned data. Using the 44,000 face models we obtained, we further developed an efficient GAN-based generator. This generator accepts semantic attributes as input, and generates geometry and albedo. It also allows continuous post-editing of attributes in the latent space. Our asset refinement component subsequently creates physically-based facial assets. We introduce a comprehensive system designed for creating and editing high-quality face assets. Our proposed model has undergone extensive experiment, comparison and evaluation. We also integrate everything into a web-based interactive tool. We aim to make this tool publicly available with the release of the paper.

Bringing Diversity from Diffusion Models to Semantic-Guided Face Asset Generation

TL;DR

The paper tackles the challenge of creating diverse, semantically controllable 3D face assets with realistic textures suitable for PBR rendering. It introduces a diffusion-driven data generation pipeline to build a large UV-texture/head-geometry dataset, a two-stage GAN-based generator to produce geometry and albedo conditioned on demographic attributes, and a texture-normalization mechanism to convert diffusion-derived textures into clean albedo maps. Asset refinement further adds high-frequency detail, specular/displacement maps, and secondary components, enabling production-ready assets with inversion and editing capabilities. An interactive web interface demonstrates practical usability, and extensive experiments show improved semantic control, texture quality, and faster generation times compared to prior methods. The approach offers a scalable path for diverse synthetic avatar creation in VFX, gaming, and data generation, while acknowledging limitations around geometry diversity, UV completion, and diffusion biases.

Abstract

Digital modeling and reconstruction of human faces serve various applications. However, its availability is often hindered by the requirements of data capturing devices, manual labor, and suitable actors. This situation restricts the diversity, expressiveness, and control over the resulting models. This work aims to demonstrate that a semantically controllable generative network can provide enhanced control over the digital face modeling process. To enhance diversity beyond the limited human faces scanned in a controlled setting, we introduce a novel data generation pipeline that creates a high-quality 3D face database using a pre-trained diffusion model. Our proposed normalization module converts synthesized data from the diffusion model into high-quality scanned data. Using the 44,000 face models we obtained, we further developed an efficient GAN-based generator. This generator accepts semantic attributes as input, and generates geometry and albedo. It also allows continuous post-editing of attributes in the latent space. Our asset refinement component subsequently creates physically-based facial assets. We introduce a comprehensive system designed for creating and editing high-quality face assets. Our proposed model has undergone extensive experiment, comparison and evaluation. We also integrate everything into a web-based interactive tool. We aim to make this tool publicly available with the release of the paper.

Paper Structure

This paper contains 32 sections, 9 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Overview of the proposed dataset preparation method: Portraits are initially synthesized using a latent diffusion model that is conditioned by semantic facial attributes and frontal view normal maps. A pre-trained face reconstruction model is then applied to these portraits to extract modified geometries in the form of position maps. Textures are completed by blending the projected portraits with physically-based rendered texture maps from the scanning database.
  • Figure 2: Two types of 3D human face data. (a) From left to right: the template model used for registering all data resources, sample of Light Stage data, and Triplegangers data. (b) Samples of semantic attributes and the resulting 2D portraits from LDM.
  • Figure 3: Examples of training data. From left to right in each example are: the input attribute (semantic and skin tone guide), portrait, map before normalization, normalized texture map, images rendered wo/w the post-processing. (\ref{['sec:normalization']}).
  • Figure 4: Diagram of the normalization network.
  • Figure 5: Diagram of the normalization network training pipeline.
  • ...and 13 more figures